CN110162556A

CN110162556A - A kind of effective method for playing data value

Info

Publication number: CN110162556A
Application number: CN201811116595.1A
Authority: CN
Inventors: 王伟
Original assignee: Shaanxi Aishanwulian Technology Co ltd
Current assignee: Shaanxi Aishanwulian Technology Co ltd
Priority date: 2018-02-11
Filing date: 2018-09-25
Publication date: 2019-08-23

Abstract

This programme provides a kind of method for effectively playing data value, the method or model of specific code requirement, different function units (or being service) are connected by defining good interface or contract and to form system, different function units (or being service) can be each process of data processing and application, data (especially big data) is solved in the double challenge of technology and application aspect, make the collection of data, management, analysis, it excavates, using etc. various aspects can be more preferable, more easily, more efficiently carry out, to more effectively play the value of data.

Description

A kind of effective method for playing data value

Technical field

The present invention relates to a kind of methods more particularly to a kind of mass data (especially big data) for effectively playing data value Technology, application method.

Background technique

Data (Data) are the set of number, symbol, letter and various texts, are also possible to image, sound, computer generation Code etc., is a kind of expression-form to the fact, concept, instruction or content, can be set and be handled by manual or automatic makeup, It is become later as information by explaining and assigning certain meaning；Can simply be interpreted as data be exactly load or record information, By the phy symbol of certain regular permutations and combinations.In recent years, the whole world, which possesses, surpasses 4,600,000,000 mobile phones and 2,000,000,000 Internet users, than The quantity that ever carries out data or information exchange is all big, and Chinese population is numerous, Internet Users in 2013 Data are persistently generated through the Internet user more than 500,000,000 people, magnanimity.

The information of physical world largely digitizes, and (weak relationship generates Information Number for the behavior that teahouse is chatted such as Sina weibo Word), the behavior of friend's chat digitizes (strong relationship generation information digitalization), and video monitoring is popped one's head in image digitazation；Society It hands over network in Yahoo's epoch, is substantially read operation, the editor of only Yahoo does the work of some write operations, web2.0 epoch Number of users increases significantly and actively has submitted factum, enters social epoch, time of cell-phone a large amount of mobile terminal devices Appearance so that user not only actively submits factum also to carry out real-time interaction with social circle, mass data is therefore And it generates and has extremely strong propagated；Data will save, as San Francisco bridge remains century-old history number According to producing value on time span, many websites are in early stage to the attention degrees of data not enough (also because storing the valence of equipment Lattice are expensive, and the cost for saving data is very big), storage equipment is cheap at present, and the value of data is also taken seriously, more and more Data be taken seriously and persistently save etc., a large amount of data also generate therefrom.

Mobile Internet makes data and locating scene and people has relationship, and information needs to obtain from people and physical world , information is gradually more important than internet world under this line, so that with the service on front to extending under line, Xiang Chuantong Industry extends, such as Intelligent hardware, automatic Pilot, robot can play the part of more importantly role in future, and each button It may be a networked devices, or even sensor is armed to the teeth；The equipment (object) of networking is more and more, bring knot Fruit is exactly persistently exploding for data；Based on data and network, all trades and professions actively by establish Internet of Things application make its product, Equipment or service can feedback data in use, and utilize its upgrade service or product, so adjust future-oriented strategy and Business model.

The meaning of data is that information can be transmitted, and starts from the reception to data to the reception of information, the acquisition to information The interpretation to data background can only be passed through；Data background is that recipient prepares for the information of specific data, i.e., as recipient When solving the rule of phy symbol sequence, and knowing the directive property target or meaning of each symbol and symbol combination, it can obtain The information of one group of data institute load；That is, data are converted into information, can be indicated with formula " data+background=information ".Service Refer to and do things for other people, and the activity that the one kind for making other people benefit from it is paid or free, not with physical form to provide work The form of labour meets other people certain special requirement；The offer of service can be related to: in the tangible products that customer provides (as repaired Automobile) on the activity completed；On the immaterial product (for example income statement needed for preparation declaration book) that customer provides The activity completed；The delivery of immaterial product (information in terms of such as knowledge is taught provides)；It is customer's climates for creativity (such as in guest The restaurant Guan He).DaaS is Data-as-a-service, is the service concept that another is new after IaaS, PaaS, SaaS, Referred to as data service, i.e., data are to be realized by transmitting useful information with helping other people activity as a kind of service , master worker's maintenance is repaired in the data help of such as composition in relation to automobile and damaged condition, we often surf the Internet inquiry data, this It is also a kind of service that useful information in a little data has an impact to our activity.

With being constantly progressive for the science and technology such as computer, network, society enters scientific and technological prosperity, life convenient, information flow Logical fast, people exchange the stage of close high speed development, and data largely generate, but not singly refer to that people issue on the internet Information, global industrial equipment, automobile have countless digital sensors on ammeter, measure and transmit at any time and is related It is the variation of chemical substance etc. or even mobile Internet in position, movement, vibration, temperature, humidity or even air, mobile phone, flat Plate computer, PC, smart machine and the various sensors throughout each corner of the earth, there are also RFID, sensor network Network, community network, Internet of Things, cloud computing, robot, car networking, the manufacturing, society, enterprise thing political affairs quotient, farming, forestry, husbandary and fishing, interconnection Net text/file and click condition, internet hunt index, network log, communications records, transaction data, media content are (such as Text, voice, image and video), astronomy, atmospheric science, genomics, biology, geochemistry, biology and other are multiple Scientific research miscellaneous and/or interdisciplinary, military surveillance, medical records, photography archives video archive, e-commerce, amusement and travel, clothing Row, geographical space, style entertainment (sportsman individual and competition field in such as Basketball Match) are eaten, is all broadly coming for data Perhaps bearing mode all trades and professions are all generating the data of enormous amount or fragmentation of data and are always existing daily in source, only Cybertimes data become online and quantity is more and more, such as the taxi-hailing software data of one traffic, if these things It is not difficult to use then online, exactly because and the data of Taobao are costly online；Data collection is allowed to become very easy online, So that data can be with rapid contribution society, as taxi-hailing software influence taxi driver's possibility is bigger than taxi company；Society Into after the stage of high speed development, science and technology is flourishing, and information flow is fast, and the exchange between people is more and more closer, and life is increasingly Convenient, online data are more and more；US Internet data center points out that the data on internet will increase by 50% every year, It just will every two years double, and 90% or more data are just to generate recent years in the world at present；Data metering unit PB, EB, ZB, YB even BB, NB, DB are developed to from Byte, KB, MB, GB, TB, when data are constantly increased with increasing magnitude When long, need preferably to be managed and handled using specific technology, extract and seen very clearly to life from mass data information Living, business or production practices are preferably instructed.

About since 2009, " big data " (Big data) just becomes the buzzword of Internet information technique industry It converges, and proposes that the mechanism that " big data " epoch have arrived is the well-known consulting firm Mai Kenxi in the whole world earliest, Mai Kenxi is being studied It is pointed out in report, data have penetrated into each industry and operation function field, it is increasingly becoming the important factor of production, and People will imply the utilization of mass data the arrival of new wave increase in productivity and consumer surplus's tide；Mai Kenxi Report publication after, big data has rapidly become the popular concept that computer industry falls over each other to be widely read, and causes financial height Concern.It is predicted according to IDC, will possess the data volume of 35ZB in total to the year two thousand twenty whole world, whole world major part giant enterprise is also all The significance of codes or data when appreciating " big data ", it is numerous and confused to realize that technology is whole by purchase " big data " relevant manufactures It closes；Alibaba originates forces' cloud and also mentions, and following epoch will not be information-technology age, but the epoch of DT, (DT referred to Data Technology data science and technology).

Big data development is also established as national strategy, implements " internet+" action plan, develops sharing economy, implements National big data strategy, greatly develops industrial big data and new industry big data, pushes information-based and industry using big data Changing depth integration to push manufacturing industry networking and intelligence is the Hot spots for development of industrial circle technology in recent years.Industry enterprise The industrial data of the high-end equipment industry of industry to master grasps the industrial big data of China's Manufacturing Industry, and effectively big with industry Data.In the Internet of Things planning of Ministry of Industry and Information's publication, the information processing technology is suggested as one of 4 key technology Innovation projects Come, including mass data storage, data mining, image/video intellectual analysis, this is all the important composition portion of big data Point；And information Perception technology, the information transmission technology, information security technology, it is also all closely related with " big data ".

Big data is internet development to a kind of feature at this stage, using cloud computing as the technological innovation curtain of representative It sets off down, these data that is difficult to collect originally and use start to be easy to be utilized.With the arriving of cloud era, greatly Data have also attracted more and more concerns, it is believed that big data is commonly used to a large amount of non-knots for describing that a company creates Structure and semi-structured data, these data are downloading to relevant database for meeting overspending time and gold when analyzing Money faces magnanimity web data such as the search engine of Google, proposes the concept of cloud computing first in 2006, and utilize cloud Calculation server supports its various big data application.Big data be substantially comprehensively, mix and have data volume it is big, The data of input and the low feature of processing speed is fast, data diversity, value density, have cloud computing server just to have " several greatly According to " application value.

Someone explains: the core of big data is that have enough data, in the presence of the information of inaccuracy and not in permission data Go to seek the reason of event occurs but what can be occurred by seeking；The interpretation of wikipedia is: big data or flood tide number According to, mass data, big data, refer to involved in data volume it is huge to can not by artificial, reach within the reasonable time cut Take, manage, handle and be organized into the information that can be interpreted for the mankind；Baidupedia is defined as: big data or flood tide data, Data quantity involved in referring to it is huge to can not through current main software tool, reach within the reasonable time acquisition, Management handles and arranges the information for becoming the help more positive purpose of enterprise management decision-making；Also have expert to think: big data is the same as letter Breath be it is inseparable, refer to the statistics and technical operations of the great quantity of information；360 encyclopaedias are defined and are said, big data or flood tide Data refers to needing new tupe that could have stronger decision edge, the magnanimity of insight and process optimization ability, Gao Zeng Long rate and diversified information assets.

With the continuous development of Internet technology, data itself are assets especially big datas, this point in the industry cycle shape At common recognition；Big data is substantially the network behavior institute association become increasingly popular by the mankind, and core is can be in mining data The value contained, the great attention by international community and domestic relevant departments, enterprise；Big data technology, which has, moves towards numerous The potentiality of enterprise, the enterprise that valuable information can be obtained from mass data in first time is just most advantageous, and acquisition is contained Data producer true intention, hobby, have the data of non-traditional structure and meaning, and " purification " is useful out from mass data Information, for the network architecture and data-handling capacity have certain challenge, experienced several years criticisms, query, beg for By, propagandize after, big data technology gradually tends to be mature；It is ground for the big data application model of different field, business model Studying carefully becomes the key that big data industry develops in a healthy way.

Big data (big data) has 5 " V " (data volume big (Volume), type more (Variety), value (Value), speed fast (Velocity), authenticity (Veracity)) feature, feature has five levels in other words: the data scale of construction is huge Greatly, from TB rank, PB rank is risen to；Data type is various, such as network log, video, picture, geographical location information； Value density is low, and by taking video as an example, during continuous uninterrupted monitoring, the data to come in handy only have one two seconds；Processing speed Degree is fast, 1 second law；The accuracy and Reliability of data, the i.e. quality of data；Academia is very before big data Large data, the meaning of large is in volume in English, and the meaning of big includes in weight and the magnitude of value On；Big data is not instead of quantitative to pile up, and has very strong relevance, structural；Have with structural data extremely strong Researching value and extremely strong commercial value, only three-dimensional, structural strong data could be big data, and ability is valuable, Otherwise large-scale data can only be；The scale of big data is had to greatly, and also bigger than the scale of large-scale data；It does Some prediction models need many data, and training corpus, if data are not big enough, many excacations are difficult to do, such as clicking rate Prediction；If you can know that long-term whereabouts data, the behavior of online, read operation and the write operation of a user, almost may be used Accurately to be predicted very much this people, the work of various recommendations can be accomplished very precisely.

The development of science and technology and internet, for all trades and professions daily all in the fragmentation of data for generating enormous amount, each row is each There is big data in industry, data metering unit from Byte, KB, MB, GB, TB develop to PB, EB, ZB, YB even BB, NB, DB is measured.The excavation and processing of big data must use cloud computing, and cloud computing (Cloud Computing) is distributed meter Calculate (Distributed Computing), parallel computation (Parallel Computing), effectiveness calculating (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), negative It carries the traditional computers such as balanced (Load Balance), hot-standby redundancy (High Available) and network technical development is melted The product of conjunction；The crucial heart of big data technology is that mass data is stored and analyzed, and is had compared with existing other technologies " honest and clean The characteristics of three aspects of valence, rapid, optimization ", overall cost is optimal；Big data can be applied to all trades and professions, the Pang that will be collected into Big data carries out analysis and arrangement, realizes the effective use of information, for example find master relevant to the output of milk in milk cow genome level Gene is imitated, we can first be scanned milk cow full-length genome, although we obtain all phenotypic informations and gene to believe Breath, but since data volume is huge, it is necessary to it is analysed and compared using big data technology, excavates major gene resistance.

The variation of data information total amount results in the variation of information formation, undergoes the subject of information explosion (such as astronomy at first And genetics) " big data " has been createed, nowadays this concept has almost been applied to the field that all mankind are dedicated to development In, also there are some apparent trend: since 2015, some wearable, intelligence and have data function equipment and The market of early deployment person has outburst, and Internet of Things application is very universal；Text analyzing will be more widely used, such as It is modern that we store and have gradually become non-structural type data for most of data of analysis, in the past few years in, text Analysis has become to become increasingly complex, this trend also continues to development and goes down, and computer will more expertly " be read Read " article (or converting text for sound), and it will be appreciated that article theme and emotion, it means that these articles It can be classified and analyze as structure data；Data visualization tool will rule market, and market, which has had already appeared, allows number Factually existing visual professional software, they can more easily be found rule therein with let us, find causal relation, this A little softwares will become to become increasingly complex and be widely used, and growth rate will be that other business intelligence software products market increases 2.5 times of long speed；The public will generate huge fear to privacy, in recent years as users such as apple, Sony and Snapchat The loophole met with is the same, and considerable safety loophole does not influence masses all the time and shares in social media and network The behavior of the privacy trifling matters of everyday life, nowadays hacker has been able to threaten safest system, and government and executive arm prevent from counting According to leakage, but very slowly, catastrophic hacker attack or leakage of information possibility will for the process that unprincipled fellow is restrained by law It is enough to change the attitude of people, people is allowed to restore the consciousness of protection personal data；Businesses and institutions will competitively find data people , the practitioner for directly setting foot in the post of big data analysis may reach 4,400,000 people next year, but this quantity is also not It is enough, it is shown according to market watch, data policy appropriate will be executed to 2015 70% u s companies, or be not far Related data strategy is worked out in the future, although setting is continuously increasing with big data analysis in relation to university's quantity of course, Has the headcount of following institute's palpus technical ability still in ongoing shortage；Big data will provide the key of numerous mysteries in solution open universe Spoon, Large Hadron Collider is currently in upgrading, it is contemplated that it will be put back at the beginning of next year, it is per second in the device High speed proton impact will occur 600,000,000 times, and the information taken every year reaches 30 bats, these information are by by being dispersed in 36 countries Network composed by 170 calculating facilities is analyzed, and is scientific research big data experimental project maximum so far, they It is successfully found the particle to match with Higgs bosons theory at present, many people think that this discovery means Understand in terms of the origin in universe and the mystery of operating and to advance towards being correctly oriented, Large Hadron Collider after upgrading Performance is twice before upgrading, and new discovery is centainly also had after putting back into；The range of data and big data is expanding Greatly, value is also expanding, and gradually will be changed into data value from function value.

Big data is not specific method, or even not counting specific research subject, but to certain a kind of problem or needs to handle Data description, generally defined as " beyond popular software tool capture, management and processing capacity " data set；Big number According to often utilizing numerous technology and methods, it is comprehensive be originated from multiple support channels, the information of different time and obtain, in the past can not Metering, storage, analysis and shared many things all by digitization, possess a large amount of data and more less accurate numbers According to for we have appreciated that the world opens the new gate of a fan, scientific progress is pushed by data more and more, and mass data is given Data analysis both brings opportunity, new challenge is also constituted, in order to cope with big data bring challenge, it would be desirable to new Idea and method: big data era requires calculating mode to be changed into " data " core, database, record data from " process " core Library can develop profound information and illustrate that data are really more important than process, be asked with the thinking of the data core heart mode of thinking Topic solves the problems, such as, reflects the change of IT industry instantly；One most prominent feature of big data thinking is from traditional cause and effect Thinking turns to related thinking, and cause and effect thinking instructs us to have to find reason release one as a result, and big data is not necessarily to Reason is found, does not need the means of science to prove to have between this event and that event necessary factor, certain causality Or basis, only in the high speed information epoch, instant messages are predicted in real time in order to obtain, in quick big data point Correlation information is searched out under analysis technology, so that it may be predicted the behavior of user, be provided lead for high-speed decision；Data are explained It is information, information common sense is knowledge, and data are explained, data analysis will all generate immeasurable value；Big data analysis is just It is searching modes, correlation and other useful information during studying mass data, to help enterprise to better adapt to Variation, and make wiser decision.

4 key problems nothing but on big data technological essence: how the data of magnanimity effectively store；The data of magnanimity are such as What is quickly calculated；Mass data how quick search；How mass data excavates hiding knowledge."Google file System " it is discussed how effectively to store the big data of magnanimity by common machines, " Google MapReduce " is discussed How the data of magnanimity are quickly calculated, " Google BigTable " is discussed how to realize the quick search of mass data, After this three paper outflows of Google, it is dedicated to open apache and has investigated corresponding hadoop and opening, Hadoop includes hdfs, MapReduce and hbase three parts: hdfs solves the storage problem of big data, mapreduce solution The certainly computational problem of big data, hbase solve the problems, such as the inquiry of big data quantity；Apache Hadoop open source projects development teams On the basis of the Highly Scalables such as MapReduce, high performance distributed big data processing frame, clones and be proposed Hadoop/ Yarn system is widely recognized as and is adopted by academia and industry, and hatch numerous sub-projects (such as Hive, Zookeeper and Mahout etc.), it is constantly derivative and to evolve be various branch's schools, increasingly formed an easy deployment, easy exploiting, Multiple functional, function admirable ecology, wherein memory technology is ruled all the land by hdfs substantially, and computing technique and inquiring technology are not Disconnected evolution: in order to adapt to traditional data store mode and reduce big data processing difficulty, there is hive, pig, impla etc. The easy programming method of SQL ON Hadoop, the appearance of Spark are intended to solve all big data computational problems, The streaming computings technology such as SparkStreaming, Storm, S4 can be realized data instant computing, and it is uncommon that apache releases flink Unified stream calculation and batch is hoped to calculate；On this basis, Spark/BDAS technology realizes the distributed treatment mould of memory hierarchy Formula makes the user do not need the complicated internal work mechanism of concern, without having distributed system knowledge and development Experience abundant, The deployment of large scale distributed system and the parallel processing of big data can be realized.

Only small part company possesses enough technical forces before and fund goes to store and excavate mass data to obtain Must see very clearly, the open source of Apache Hadoop in 2009 produces subversive impact to this situation --- by using commercialization The cluster of server composition is greatly reduced the threshold of mass data processing, many industries (such as Health care, Infrastructure、Finance、Insurance、Telematics、Consumer、 Retail、Marketing、E- Commerce, Media, Manufacturing and Entertainment) therefore start the journey of Hadoop, go on sea Measure the road that data extract value.Hadoop mainly provides the function of two aspects: by horizontal extension commercialization host, HDFS It provides a cheap mode and fault tolerant storage is carried out to mass data；MapReduce computation paradigm provides one simply Programming model carry out mining data and seen very clearly.A Map-Reduce step in the flow chart of data processing of MapReduce The input results that will work as next typical case Hadoop of output, intermediate result is transmitted by disk, containing what is largely calculated Map-Reduced operation can be limited to IO；And for ETL, Data Integration and clear up for such use-case, IO is constrained not It can have a huge impact, because these scenes have tended not to higher demand to data processing time.It is deposited in real world In many use-cases more harsh to delay requirement, if stream data is handled to do near real-time analysis, passes through to analyze and click Flow data does video recommendations to improve in the participation use-case of user, and developer must balance between precision and delay.

Hadoop is the software frame that distributed treatment can be carried out to mass data, be with it is a kind of reliable, efficiently, The Distributed Computing Platform that telescopic mode is handled depends on community server, and than low, anyone is ok cost It uses, user easily can develop and run the application program of processing mass data on Hadoop, mainly have following Advantage: the ability value of high reliability, Hadoop step-by-step storage and processing data obtains people's trust；High scalability, Hadoop be Data are distributed between available computer cluster and complete calculating task, these clusters expand to thousands of in which can be convenient Node in；High efficiency, Hadoop can dynamically mobile data among the nodes, and guarantee the dynamic equilibrium of each node, Therefore processing speed is very fast；High fault tolerance, Hadoop can automatically save multiple copies of data, and can automatically by The task of failure is redistributed；The frame that Hadoop writes with useful Java language, therefore operate on Linux production platform and be Ideal, the application program on Hadoop also can be used other language and write, such as C++.

Storm is free open source software, and a distributed, fault-tolerant real time computation system can be very reliable Huge data flow is handled, for handling the batch data of Hadoop；Storm is very simple, supports many kinds of programming languages, makes It uses very interesting, is come by Twitter open source, well-known to apply enterprise includes Groupon, Taobao, Alipay, Ah Li Baba, happy element, Admaster etc.；There are many application fields by Storm: analysis in real time, does not stop at online machine learning Calculating, distribution RPC (remote procedure call agreement, one kind by network request from remote computer program service), ETL (abbreviation of Extraction-Transformation-Loading, i.e. data pick-up, conversion and load) etc.；Storm's Processing speed is surprising: each node each second can handle 1,000,000 data tuples after tested；Storm be it is expansible, fault-tolerant, It is easily arranged and operates.

In order to help enterprise customer to find method that is more effective, accelerating Hadoop data query, Apache software fund Can initiate a key name be " Drill " open source projects --- Apache Drill realizes Google's Dremel, according to Hadoop manufacturer MapR Technologies Products are handled Tomer Shiran and are introduced, " Drill " conduct Apache incubator project operates, and will persistently promote towards Global Software engineer；The project will be created that open source version Google's Dremel Hadoop tool (Google is mentioned using the tool for the Internet application of Hadoop data analysis tool Speed), and " Drill " will be helpful to Hadoop user and realize the purpose for faster inquiring mass data collection, " Drill " project is in fact And obtain inspiration from the Dremel project of Google: the item help Google realizes the analysis processing of mass data collection, including Analysis crawl Web document tracks the application data being mounted on Android Market, analysis spam, analysis paddy Sing the test result etc. in distributed building system；By exploitation " Drill " Apache open source projects, organization will have It hopes and establishes api interface belonging to Drill and flexibly powerful architectural framework, to help to support extensive data source, data lattice Formula and query language.

RapidMiner is Data Mining Solutions advanced in the world, has advanced skill in a very big degree Art, its data mining task coverage is extensive, including various data art, can simplify the design of data mining process with Evaluation；Function and feature: providing free data mining technology and library, and 100% with Java code (may operate at operating system), Data mining process is simple, powerful and intuitive, internal XML ensure that standardized format indicate exchange data mining process, It can carry out extensive process automatically with simple scripting language, multi-level Data View ensures effective and transparent data, figure The interaction prototype of shape user interface, order line (batch mode) automatic large-scale application, (application programming connects Java API Mouthful), simple plug-in unit and Extension, powerful visualization engine, the visual modeling of the high dimensional data at many tips, 400 Multiple data mining operators support；Yale University has been successfully applied to many different application fields, including text is dug Pick, Multimedia mining, Functional Design, data Mining stream, the method and distributed data digging of Integrated Development.

Pentaho BI platform is different from traditional BI product, it is one centered on process, towards solution (Solution) frame, its object is to integrate a series of enterprise-level BI products, open source software, API etc. component Come, the exploitation for facilitating business intelligence to apply；Its appearance, so that a series of stand-alone product towards business intelligence is such as Jfree, Quartz etc. can be integrated, composition complexity in every particular, complete business intelligence solution. Pentaho BI platform, the core architecture of Pentaho Open BI external member and basis, are centered on process, because wherein Pivot controller is a workflow engine；Workflow engine process for using defines to define the business intelligence executed on BI platform Process, process can be easily customized, and can also add new process, BI platform includes component and report, to analyze The performance of these processes；The main component of Pentaho includes report generation, analysis, data mining and work flow tube at present Reason etc., these components pass through the technologies collection such as J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals At to Pentaho platform；The distribution of Pentaho is mainly carried out in the form of Pentaho SDK；Pentaho SDK is total Include five parts: Pentaho platform, Pentaho illustrative data base, can independently operated Pentaho platform, Pentaho solution Certainly scheme example and a preparatory prepared Pentaho network server；Wherein Pentaho platform be Pentaho platform most Main part enumerates the main body of Pentaho platform source code；Pentaho database is the normal fortune of Pentaho platform The data service, including the relevant information of configuration information, Solution etc. that row provides, it is not for Pentaho platform Be it is necessary, can be replaced with other database services by configuring；Can independently operated Pentaho platform be The example of the independent operation mode of Pentaho platform, how it makes Pentaho platform in no application server branch if being demonstrated Independent operating in the case where holding；Pentaho solution example is an Eclipse engineering, how to be for demonstrating The relevant commercial intelligence resolution of Pentaho platform development；Pentaho BI platform construction is in server, engine and component Basis on.These provide the J2EE server of system, safety, portal, workflow, regulation engine, chart, cooperation, Content Management, data integration, analysis and modeling function；The major part of component be it is measured, can be used other products replace It.

Network service can be understood as the Seamless integration- based on WEB, the mould with standalone feature unrelated with platform, language Block；How Web service is applied among the IT system and business procedure of enterprise and as very powerful and exceedingly arrogant technology to enterprise's band Carry out direct economic benefit, always by the highest attention and high praise of domestic and international company manager.And what is occurred in recent years is known as The SOA (Service-oriented architecture, Service-Oriented Architecture Based) of the architecture of next-generation Web service, just It is that some Servers Organizations are completed different demands, can also be improved the robustness of system, can safeguarded based on internet Property, scalability is portable.1996, Gartner propose earliest SOA (Service-oriented architecture, Service-Oriented Architecture Based).In December, 2002, Gartner propose that SOA is " the most important project of modern Application development field ".Very much IT tissue has been successfully established and has implemented SOA application software many years, such as BEA, IBM manufacturer see it is confused after its value Confused to follow up, the CIO Rhonda of BEA just proposed for the IT infrastructure of BEA to be changed into SOA early in June, 2001, and from To the control ability of entire enterprise architecture, promotes development efficiency, Speeding up development speed, reduces in customization and personnel's technical ability Investment etc. achieves good achievement.But SOA be actually what opinions vary, it is a kind of framework that someone, which says, and someone, which says, is A kind of methodology, it is a kind of thought that someone, which says,；And specifically servicing using which kind of specific technical approach is not crucial, key It is that can be called between other modules or module how by each module service；In addition, SOA and framework can be with Individually consider, but combining will preferably mutually merge and promote.

SOA is a kind of specification of logic (service) unit of design in a computing environment, exploitation, application, management dispersion, this It is a to define the popularity for determining SOA.Aiming at for SOA allows IT to become more flexible, quickly to respond service unit Demand realizes Real-Time Enterprise (Real-Time Enterprise, this is the vision target that Gartner is SOA description).SOA Not only a kind of methodology of exploitation -- it also includes management, as using the management that after SOA, manager be can be convenient these The application on service platform is built, rather than manages single application module.

A central idea of SOA is so that enterprise's application gets rid of the constraint of the solution of technology oriented, light reply Enterprise business service change, development needs；Single application program is can not to contain service-user (various) in corporate environment Demand, even a large-scale ERP solution, be not still able to satisfy this demand it is continuous expand, variation lacks Mouthful, it quickly makes a response to market, commercial user can only be by continually developing new opplication, extension existing application come difficult Support its existing business demand.By the way that attention to be placed in service, it is richer that application program can put together offer Rich, the stronger business procedure of purpose, the enterprise application system based on SOA would generally be more truly reflected out and business mould The combination of type；Service is to technology from the perspective of operation flow, and this angle is driven with general from available technology Business visual angle on the contrary, advantage is that they can be combined together with operation flow, indicate business model, more preferable more accurately Support operation flow in ground；And enterprise's application model centered on application program force service-user by its power limitations be answer With the ability of program.

Traditional application integration method (integrated (EAI) of point-to-point integrated, enterprise message bus or middleware, is based on industry Process of being engaged in integrates) it is all very complicated, expensive and not flexible；These integrated approaches are difficult to rapidly adapt to based on Enterprises ' Modern industry The demand that business variation constantly generates.Application and development based on Service-Oriented Architecture Based (SOA) can be very good to solve wherein with integrated Many problems；SOA describes a set of perfect development mode to help client application to be connected in service, these modes Customized serial mechanism for describe service, notice and discovery service, communicated with service；Different from traditional application collection At method, all modes in SOA around service are realized with measured technology, among most communication Part system such as RPC, CORBA, DCOM, EJB and RMI is same；But their realization is simultaneously imperfect, in tradeoff interactivity And the acceptable aspect of standard customization is constantly present problem；SOA attempts to exclude these defects, because almost all of logical Letter middleware system has the function of fixed tupe such as RPC, object of CORBA etc., and servicing can both be defined as Function, but can simultaneously externally be defined as object, using etc., be adapted to any existing system, and make system when integrated not Any specific customization must deliberately be followed.SOA helps enterprise information system to move on " leave-and-layer " framework, anticipates Taste do not have to existing business system is made an amendment under the premise of, system can externally provide Web service interface, because The application layer that Web service interface can be provided has done one layer of encapsulation, so in the case where not having to modification existing system framework System and application can be rapidly converted into service.SOA is not only covered from packing application, customized application and Legacy System In information, but also cover function and data in the IT architectures such as such as safety, Content Management, search；Because being based on The application of SOA easily can add function from these infrastructure service frameworks, so the application based on SOA can be answered quickly To turn of the market, to make enterprise business unit design and develop out new functional application.It is applied using the enterprise based on Services Integration, It is that the system uses measured service with the main distinction of traditional Enterprise information integration framework, and including process/number According to service, layout and combination, measured service increases clothes at the integrated point between application, the layout and combination of service Flexibility, reusability and the integration of business.It is used service be in systems it is independent, will not be influenced if there is problem be Elsewhere, which also improves the robustness of system, maintainability, scalabilities for system, because it is unrelated with platform language, So transplantability is also fine, and reduces the development cycle, such as a website needs to show a local weather, that We need not just go to realize such function in the code for writing weather, can use such service to realize, greatly improve Development efficiency.

Application programming interface (Application Programming Interface, API) is some pre-defined Function, it is therefore an objective to provide application program and developer based on certain software or hardware be able to access one group of routine ability, and It is not necessarily to access source code again, or understands the details of internal work mechanism；Sometimes the document that illustrates of API is also refered in particular to, text is also referred to as helped Shelves；It can be appreciated that application programming interfaces (Application Program Interface), be one group of definition, program and The set of agreement, a major function are to provide general utility functions collection, and programmer carries out application program by API Function Exploitation, can mitigate programmed tasks；It is also simultaneously a kind of middleware, provides data sharing for various different platforms；It can pass through Api interface realizes being in communication with each other between computer software.According between different software applications on single or distributed platform Data sharing performance, can be four seed types: remote procedure call (RPC) by API point, it is slow by acting on shared data Process (or task) on storage realizes the communication between program；Standard query language (SQL) is looking into for the access data of standard Language is ask, the data sharing between database realizing application program is passed through；File transmission, file are transmitted through transmission and format text Part realizes data sharing between application program；Information is delivered, and refers to the small-formatization letter between loose coupling or close coupling application program Breath realizes data sharing by the direct communication between program；The standard for being currently applied to API includes ansi standard SQL API； In addition, being applied to other types of standard still among formulation there are also some.

API can be applied to all computer platforms and operating system, connect data (such as shared data in a different format Buffer, database structure, file frame), every kind of data format requirement is realized correct with different data commands and parameter Data communication, but different types of mistake can be also generated simultaneously.Therefore, know needed for executing data sharing task in addition to having Other than knowledge, the API of these types must also solve the problems, such as many network parameters and possible error condition, i.e., each to apply journey Sequence between itself whether having powerful performance support program all it must be appreciated that communicate.On the contrary since this API only handles a kind of information Format, so the information under the situation delivers API and only provides lesser order, network parameter and error condition subset.It is positive because For in this way, delivering API mode greatly reduces system complexity, so when application program needs to realize number by multiple platforms According to it is shared when, using information deliver API type be more satisfactory selection.API and graphical user interface (GUI) or command interface There is a distinct difference, api interface belongs to a kind of operating system or routine interface, both then belongs to end user's interface.

Some companies can formulate the system interface standard of oneself using API as its public open system, when needing to be implemented When the operations such as system combination, the application of customized and program, all members of company can call source generation by the interface standard Code, the interface standard are referred to as open API；Opening API (OpenAPI) is also a kind of common application in service type website, Website service is packaged into a series of API and opened away for third party developer's use by the service provider of website, i.e., open website API, the API opened are referred to as OpenAPI (opening API)；After website provides opening API, it can attract some third-party Developer develops business application on the platform；Development foundation of the OpenAPI as internet online service, has become The inevitable choice of more and more Internet enterprises development services.

It is SOA by data all types of between these models progress optimal fitting in more and more virtual resources The advantages of one of huge challenge faced, each SOA model management data, selection and option are had nothing in common with each other.Three of SOA Data center model is that data service (DaaS) model, physical level structural model and infrastructure component model, DaaS number respectively Describe data are how to be supplied to SOA component according to the model of access, physical model describe data be how to store with And the hierarchy chart of storage how is transmitted on SOA data storage, infrastructure component model describes data, data management Relationship between service and SOA component.The data requirements of data enterprise can be by relational database management system (RDBMS) Clause indicate that such enterprise may directly adopt database facility or by dedicated database server and existing Query service be connected in SOA component (inquiry service or QaaS)；This design concept before 5 years or earlier before Through being accepted by people, why which is successfully because it balances the relationship between above three model: QaaS service Model is not mechanically connected on memory, but passes through a single framework --- RDBMS (relational data depositary management Reason system), data deduplication and integrality are convenient for managing single framework；By the example of big data may be better understood for Any this simple method cannot but handle data in the larger context, and most big datas is non-relational, non-friendship Easy type, the non-structured data not updated even are not easy at a query service by the data abstraction for lacking structure Thing, and the data of multiple sources and form seldom sequentially store, and define the integrality of basic data and duplicate removal process is that have Some rules；When being introduced into the application program of SOA as big data, it is important to define the infrastructure component in three kinds of models Model.

There are two types of selections for framework model in SOA data relationship: horizontally and vertically.Horizontal collection data model In, data collection is hidden in a set of abstract data server, which has one or more interfaces to be connected to application program On, all integrality and data management function are also provided；Though component cannot directly access data, serviced as one kind Form, just as their enterprises in a simple situation, the requirement of data is pure RDBMS model；Application component base Departing from the difference of data management between RDBMS and big data in sheet；Although this method cannot create letter for these reasons Single RDBMS interrogation model, but it at least replicates our simple RDBMS models above-mentioned；The number of Vertical collection It is connected in data service in such a way that more multiple utility program is specific according to model, which makes customer relation management, enterprise's money The application data of source planning or Dynamic Data Authentication is largely separated from each other in service level, and this separation directly relates to And arrive data infrastructure；In some cases, these application programs, which perhaps have, can directly access storage/data service SOA component；In order to provide more unified data integrities and management, it is each to operate that management server can be used as SOA component Kind Database Systems, execute common task, such as duplicate removal and integrity checking in such a way that database is specific；This method is more It is easily adapted to legacy application and data structure, but it can destroy SOA i.e. Service Principle in access data mode, it is also possible to Generate the consistency problem of data management.

Block chain is the novel of the computer technologies such as Distributed Storage, point-to-point transmission, common recognition mechanism, Encryption Algorithm Application model；Block chain is a kind of be sequentially in time combined into data block in such a way that sequence is connected one in the narrow sense Kind linked data structure, and the distributed account book that can not be distorted He can not forge guaranteed in a manner of cryptography；Broadly block Chain technology is to verify to know together algorithm with storing data, using distributed node to generate and more using block linked data structure New data, utilizes the intelligence being made of automatized script code at the safety for guaranteeing data transmission and access in the way of cryptography Can contract program the completely new distributed basis framework and calculation paradigm of one kind with operation data；Common recognition mechanism is block chain The mathematical algorithm established between different nodes and trust, obtain equity is realized in system；The machine that block chain is trusted as construction, will The mode of entire human society value delivery may be changed.Block chain has some fundamental characteristics: decentralization i.e. whole network does not have There is center ruler, system relies on the fairness constraints of multiple participants on network, so the rights and duties between arbitrary node are all It is impartial, and each node can store all data on this block chain, even if the node is damaged or is attacked It hits, any threat still will not be caused to account book；The irreversible information referred on block chain must be irrevocable, cannot arbitrarily sell It ruins, for the open source of system so that whole system is open and clear, certain transaction reaches the above just safety of 6 confirmations after being broadcasted by the whole network It places on record, and irreversible irrevocable；Can not distort i.e. ensures that information or contract can not forge, if account book is at certain A possibility that faking on personal or a few manpowers is just very high, but has an account book in each manpower, unless in entire game All change a certain account more than 51% people, otherwise it is any distort all be it is invalid, this is also collective's maintenance and supervision Superiority；Anonymity refers to that the identity information of each block node is not needed to announce or be verified, and information transmitting is anonymous to be carried out.

Hadoop is intended to sweep large data collection by a highly scalable distributed batch processing system It retouches, to generate its result.Hadoop project includes three parts, is Hadoop Distributed File System respectively (HDFS), HadoopMapReduce programming model and Hadoop Common；Hadoop platform is very large-scale for operating Data set for can be described as a powerful tool, in order to be abstracted some complexity of Hadoop programming model, gone out Multiple application development languages run on Hadoop are showed, Pig, Hive and Jaql are representatives therein；And in addition to Java Outside, additionally it is possible to write map and reduce function with other language, and (be abbreviated as using referred to as Hadoop Streaming Streaming API Calls) they；A tool of the Hadoop streaming as Hadoop helps user's creation and fortune A kind of special map/reduce operation of row, these special map/reduce operations are by some executable files or script File serves as mapper or reducer；Mapper and reducer is inputted from standard and is read in data (reading of a line a line), and handle Calculated result issues standard output；Streaming tool will create a Map/Reduce operation, and send it to properly Cluster, while monitoring the entire implementation procedure of this operation.Streaming (stream), stream is to pass through side for technical standpoint The node diagram of edge connection, each node in figure is " operator " or " adapter ", being capable of processing stream to a certain extent Interior data, node can not include outputs and inputs, also may include it is multiple output and input, the output of a node with The input of another or multiple nodes is connected with each other, these nodes are closely connected together by the edge of figure, and expression is being transported The data flow moved between operator；Streams, that is, IBM InfoSphere Streams (abbreviation Streams), in Streams In, data will flow through the operator for manipulation data flow (each second may include millions of a events) of having the ability, then to this A little data execute dynamic analysis, this analysis can trigger a large amount of events, and enterprise is made to take action in real time using instant intelligence, It is final to improve business achievement；After data flow through these analytic units, Streams will provide operator and store data to each A position, or if being considered as valueless at all by the certain data of dynamic analysis, these data can be abandoned；You may Think that Streams and Complex event processing (CEP) system are closely similar, but the design scalability of Streams is higher, and And the data traffic supported is also more much more than other systems；In addition, Streams is also equipped with higher enterprise's level characteristics, including height Availability, application development kit abundant and high-level scheduling.

Develop after the several years, the abundant tool in the Hadoop ecosphere is greatly favored by consumers, but Hadoop is (inside Google Gradually having new alternative solution) overall efficiency is lower, such as Hadoop MapReduce platform network and disk read-write expense Greatly, it is difficult to efficiently realize the Parallel Algorithm for needing largely to iterate to calculate, be suitable for data-intensive of less demanding to the time Scene, as this prolonged large-scale off-line of log analysis calculates；And there are many as machine learning scheduling algorithm The process of step iterative calculation, calculating, which needs to obtain sufficiently small error or convergence enough after successive ignition, can just stop, If using the MapReduce Computational frame of Hadoop when iteration, calculate every time will read/write disk and task starting Equal work cause very big I/O and CPU to consume, Hadoop in terms of iterative algorithm especially machine learning ten component not from The heart；In addition, each use-case requires multiple and different technology stacks to support when using Hadoop, under different usage scenarios, A large amount of solution is often had too many difficulties to cope with；Mechanism generally requires to be proficient in several gate techniques in production environment, and many technologies are deposited again In version compatibility problem, can not in concurrent working quickly shared data.Apache Spark is an open source and compatibility The cluster Computing Platform of Hadoop is the (laboratory AMP of University of California Berkeley UC Berkeley AMP lab Algorithms, Machines, and People Lab) the universal parallel frame of class Hadoop MapReduce developed, It can be used to construct the application programs such as large-scale, low latency data analysis and machine learning；As Berkeley Data A part of Analytics Stack (BDAS) is escort by big data company Databricks, under even more Apache Top project.

In recent years, the Parallel Algorithm Study of big data machine learning and data mining became sciemtifec and technical sphere one research There is many data processing shelfs, such as Apache Samza, Apache Storm, Apache Spark in hot spot；Distribution Formula calculate implementation also there are many kinds of, such as in the distributed scene of CPU computation-intensive MPI be one well select It selects, many companies can also design the Distributed Architecture of oneself according to the business scenario of oneself, study at that time as Baidu Lee washes one's hair Parameter Server project be exactly a good Distributed Architecture, for solving the distributed computing of machine learning Problem；Domestic and international researcher and industry relatively focus in Hadoop platform progress Parallel Algorithm design, and Spark and Hadoop is substantially provided to solve a distributed computing efficiency and two kinds of Computational frames being born, is suitable for difference Application scenarios, Spark are a kind of open source cluster computing environments similar with Hadoop, are to support on distributed data collection Iteration operation and a member of supplement and the Hadoop ecosphere to Hadoop, the third party of entitled Mesos can be passed through Cluster frameworks are run parallel in Hadoop file system；It is said from the angle of communication, the MapReduce Computational frame of Hadoop JobTracker and TaskTracker between be the communication carried out by way of heartbeat and transmitting data, lead Cause execution speed very slow；The computation model of the full memory type of this RDD of Spark has outstanding and efficient Akka and Netty logical Letter system, communication efficiency are high；Spark is set to show in terms of certain workloads more superior in place of the difference of the two, more It is advantageous.

Apache Spark possesses the advantages of Hadoop MapReduce, and enables memory distributed data collection, in addition to Being capable of providing interactive inquiry, (when carrying out interactive analysis on large data collection, data science man can be on data set outside Do ad-hoc inquiry) can be with Optimized Iterative workload, data are loaded into the distribution of cluster system in the course of work Without reading and writing HDFS in depositing, data, which can quickly be converted iteration and be cached, is used for subsequent frequent requirements for access, makes Obtaining Apache Spark system has faster speed, higher performance and supports interactive calculating and complicated algorithm (Spark The performance that Logistic Regression algorithm is run in Spark and Hadoop is illustrated in official's homepage to compare, Under the operation scene of Logistic Regression, Spark ratio Hadoop is 100 times or more fast)；Apache Spark is provided Library of multilingual, including SQL, DataFrames, MLlib, GraphX, Spark Streaming etc., can be used as one A utility engines (Spark generally requires the various engines of study to handle these demands respectively before occurring) are completed Various operations (including SQL query, text-processing, machine learning etc.), developer can be in the same application program It is seamless that these libraries are applied in combination, it can also be by establishing the standard of Java, Scala, Python, SQL (reply interactive inquiry) API completes every profession and trade, each scene concrete application；Spark, which provides a high level operations symbol more than 80, ensures ease for use, supports a variety of Resource manager, including Hadoop YARN, Apache Mesos and its included separate cluster manager, with existing Hadoop The ecological compatibility of v1 (SIMR) and 2.x (YARN) and form the Spark ecosystem, allow mechanism to carry out seamless migration；It is convenient Downloading and installation, convenient shell (REPL:Read-Eval-Print-Loop) can interact the study of formula to API；It borrows High-grade framework is helped to increase productivity, so as to which energy to be put into calculating, advanced API has been removed to cluster itself Concern, Spark application developer can be absorbed in using the calculating itself to be done；Spark is realized in Scala language , code of the Scala for application framework is very succinct, and Spark and Scala can be closely integrated, and Scala can Easily to operate distributed data collection as operating local collection object；Spark has also obtained what mahout became better and better It supports, develops to 1.5 versions and have full platform character substantially；The community Spark is to meet Data Analyst, algorithm engineering teacher Many work is done, such as to Python, (DataFrame is exactly to use for reference to come to facilitate R data in R for the support of R language Scientist).

Apache Spark has advantageous advantage in terms of the batch processing of data and real-time stream process, also contains Many data execute the component of operation: GraphX supports built-in graphic operation algorithm, is particularly suitable for many connecting nodes Data set；In terms of calculation method, it is one non-that Bagel, which can carry out figure calculating with Spark as Pregel on Spark, The small project of Chang Youyong, included example realize the PageRank algorithm of Google；It has also been attached to a Web UI, has worked as operation When Spark application program, Web UI can default 4040 ports of opening and be monitored, and user can check related task wherein The details of actuator and statistical information, or even can also look at task and executing stage the time it takes, to help User advanced optimizes performance.The data analyst for being normally applied field for convenience uses known R language in Spark Data analysis is completed on platform, Spark provides the programming interface for being known as SparkR, allows to the environment in R language In easily use Spark parallel programming interface and powerful calculating ability；SparkR is one and provides lightweight for R The front end Spark R packet, provide a distributed data frame data structure, solve the data frame in R Can only be in bottleneck used in single machine, it supports many operations, such as select as the data frame in R, Filter, aggregate etc., similar to the big data grade bottleneck problem of R of the function very good solution in dplyr packet； SparkR also supports distributed machine learning algorithm, for example uses MLib machine learning library；SparkR is that Spark introduces R The vigor of speech community has attracted a large amount of data science man to start the travel for directly starting data analysis on Spark platform.

Spark-Shell is a great innovation (under shell originally, python is only used for single machine work), and The advantage of Scala language to write spark program just with write common shell script (or similar python program) equally It is easy, the scale that the every line code write can be placed to one hundred even thousands of is run up；Pervious statistics/ Machine learning depends on sampling of data, from the perspective of statistics, if the sufficiently random knot that can accurately reflect very much complete or collected works Fruit, but be in fact often difficult to carry out at random, normally result also can be very inaccurate；Also the single-unit operation often in small data set It sees that effect is pretty good, is occurred as soon as after full dose and the different situation of single machine effect；Present big data solves this by full dose data A problem, and full dose data need powerful processing capacity, spark not only has powerful processing capacity, but also SparkShell brings legendary extemporaneous inquiry, so that code sees knot when the top spark-shell is write in operation completely Fruit, if the millions of blog articles of CSDN are exactly directly full dose data run to be taken directly to see effect；Spark sampling is also very convenient, uses Sample function wants how much how many can, such as the blog article data of tens G capacity, several seconds after cache Count is completed, and completely count is also just with more than ten second time；Spark also provides algorithm, most be Bayes, Word2vec, linear regression etc. are the first choice of algorithm engineering teacher or analyst using spark-shell, it may be said that docker Deployment is overturned, spark-shell has overturned the routine work of algorithm engineering teacher.

Apache Spark persistence/caching RDD (elasticity distribution formula data set) in cluster memory, to be promoted significantly Interactive speed.Storage level can be referred in which can be convenient by operational the persist () operational access of RDD.cache () Determine MEMORY_ONLY option；Spark is removed old, seldom in caching using Least Recently Used (LRU) algorithm RDD, to release more free memories；It is slow to force to remove equally to additionally provide unpersist () operation Deposit/the RDD of persistence；RDD (having been substituted by DataFrame since 1.3 version of Spark) is the core of Apache Spark Theory, the immutable distributed collection that it is made of data are substantially carried out two operations: transformation and action；Transformation is similar to be filter (), map () or union () on RDD to generate another RDD Operation, and action be then count (), first (), take (n), collect () etc. inspire one calculate and return value To the operation of Master or stable storage system；Transformations is typically all lazy's, until action is executed It can just be performed afterwards；Spark Master/Driver can save the Transformations on RDD, if some RDD loses (namely salves is broken down) can be transformed into quickly and easily on the host survived in cluster, this namely elasticity of RDD Place.5 most common word are found such as from text, can use such solution: text being read with order The RDD of character string is taken and establishes, each entry represents 1 row in text, connects by simple Scala API Transformations and actions, it is understood that there may be the case where certain words are by 1 or more space-separated leads to some Words is null character string, it is therefore desirable to using filter (！_ .isEmpty) they are filtered out；Each word is mapped At a key-value pair: map (word=> (word, 1))；In order to add up to all countings, need exist for calling a reduce step Suddenly --- reduceByKey (_+_) (_+_ can very easily be each key assignment)；Words and respective is obtained Counts, then need to be sorted according to counts (in Apache Spark, user can only sort according to key, rather than be worth), makes (count, word) is arrived into (word, count) circulation with map { case (word, count)=> (count, word) }；It calculates Most common 5 words need to do the sort descending of a counting using sortByKey (false)；Order used contains One .take (5) (an action operation, which triggers computation) and in/Users/ 10 most common words are exported in akuntamukkala/temp/gutenburg.txt text；In Python shell Same function may be implemented in user, and RDD lineage can be tracked by toDebugString.

Shark is basically the HiveQL command interface provided as Hive in the frame foundation of Spark, in order to It is maximally maintained the compatibility with Hive, Shark has used the API of Hive to realize query Parsing and Logic Plan generation, last PhysicalPlan execution stage replace HadoopMapReduce with Spark； By configuring Shark parameter, Shark can cache specific RDD in memory automatically, realize data reusing, and then accelerate special Determine the retrieval of data set；Meanwhile Shark realizes specific data analytic learning algorithm by UDF User-Defined Functions, makes Obtaining SQL data query and operational analysis can be combined together, and maximize the reuse of RDD；Instantly Spark terminates Shark and opens Open SparkSQL, it is clear that real-time calculating is not stopped at, and target directs at general big data processing；Apache Spark is subsidiary SQL interface interacts user directly using SQL query with data, these inquiries are completely holding by Spark Row engine is come what is handled, and by Spark Engine, Spark SQL provides a convenient and fast approach to interact formula point Analysis is referred to as the RDD of SchemaRDD type using one；SchemaRDD can be established by existing RDDs or other outside Portion's data format, such as Parquet files, JSON data, or HQL is run on Hive；SchemaRDD is very similar to Table in RDBMS, once data are imported into SchemaRDD, Spark engine can be carried out to it batch or stream process； Spark SQL provides two kinds of Contexts --- and SQLContext and HiveContext is extended The function of SparkContext；SparkContext provides the access of Simple SQL parser, and HiveContext is then The access of HiveQL parser is provided, HiveContext allows enterprise to utilize existing Hive infrastructure.Such as one Simple SQLContext example: establishing SQLContext using SparkContext, reads input file, all by every a line A record being converted into SparkContext, and 30 years old male user below is inquired by simple SQL statement.

Spark provides Broadcast Variables and is broadcasted to slave nodes, so that the RDD on node is operated Broadcast Variables value can quickly be accessed.In actual production, data are closed on RDDs by specified key And scene it is very common, it is likely that will appear to slave nodes send large volume data set the case where (it is allowed to be responsible for support Pipe needs to do the data of join), and hundreds times slower than internal storage access speed of network I/O, it is likely that there are huge performance here Bottleneck.Such as expectation calculates the transportation cost of all route items in a file, it is specified every by a look-up table The cost of kind transportation types, this look-up table can serve as Broadcast Variables, then pass through order Added up the costs of all transports using accumulator.Spark provides one, and very easily approach is variable to avoid Counter and counter synchronisation problem --- Accumulators, Accumulators pass through in a Spark context Default value initialization, these counters can be used on Slaves node, but Slaves node cannot be read out it, it Effect be exactly and to transfer it to Master to obtain atomic update, Master be uniquely can read and calculate it is all Update the node of intersection.

Spark Streaming be construct on Spark handle Stream data frame, provide one it is expansible, Fault-tolerant, efficient approach handles flow data, while also using the easy programming model of Spark, basic principle be by Stream data are divided into small time segment (several seconds), handle this fraction data in a manner of similar batch batch processing (flow data being converted into micro batches, so that Spark batch processing programming model is applied in stream use-case)；Spark Streaming module provides one group of API, and for writing the application program that the real-time streams of data are executed with operation, it will can first be passed The data flow entered is divided into micro- batch, then executes operation to data again；Spark Streaming is constructed on Spark, and one Aspect is the low latency enforcement engine (100ms+) because of Spark, can also although being less than special stream data processing software On the other hand to compare other processing frames (such as Storm) based on Record for calculating in real time, a part of narrow dependence RDD data set can be recalculated from source data reaches fault-tolerant processing purpose；Furthermore the mode of small lot processing makes it can be with The logic and algorithm of compatible batch and real time data processing simultaneously facilitates and some historical data and real time data is needed to combine The certain applications of analysis；This unified programming model can integrate batch processing and interactive stream by Spark well Analysis.It is Discretized Stream (DStream) that core in Spark Streaming, which is abstracted, and DStream is by one group RDD composition, each RDD contain the data of stipulated time (configurable) inflow；Spark Streaming will be by that will flow into For data conversion at a series of RDDs, reconvert includes the number of two seconds (siding-to-siding block length of setting) at DStream, each RDD According in Spark Streaming, minimum length be can be set to 0.5 second, therefore is handled delay and be can achieve 1 second or less； Spark Streaming is also provided that window operators, it facilitates more efficiently in one group of RDD (a Rolling window of time) on calculated；Meanwhile DStream additionally provides an API, operator (transformations and output operators) can help user directly to operate RDD.

MLlib (Machine Learnig lib) is Spark the and Hadoop machine learning library under Apache, is set Meter executes the common machine learning algorithm of major part that MLlib is included for mass rapid degree, is developed based on Java Project, while being docked with language such as Python with can be convenient, user oneself can also design and write MLlib code, very tool Property；MLlib operation can by, make full use of the fast memory of Spark to calculate, the advantage that iteration is high-efficient, make engineering The model calculating time of habit greatly shortens, and the model calculated performance of machine learning is mentioned another world.MLLib is provided One group of API is mainly used for running machine learning algorithm to large data collection, and the position in the entire Spark ecosystem is closed very much Key is the realization library to common machine learning algorithm, while including relevant test and Data Generator, being supported common Machine Learning Problems such as classify, return, clustering, collaborative filtering, dimensionality reduction and bottom optimize, and can also expand algorithm It fills；MLlib is based on RDD, can be with Spark SQL, GraphX, Spark Streaming Seamless integration-, can be using RDD as base Stone and subframe jointly construct big data and calculate center；MLlib is one of MLBase, MLBase points be MLlib, MLI, Tetra- part ML Optimizer and MLRuntime, ML Optimizer can be selected it considers that most suitable realize in inside The machine learning algorithm and relevant parameter got well to handle the data of user's input, and returns to model or other helps to analyze As a result；MLI is one and carries out API or platform that the algorithm that feature extraction and advanced ML programming are abstracted is realized；MLRuntime base In Spark Computational frame, the distributed computing of Spark can be applied to machine learning field.

MLlib framework mainly includes three parts: underlying basis includes the Runtime Library of Spark, matrix library and vector library；It calculates Faku County includes generalized linear model, recommender system, cluster, decision tree and assessment scheduling algorithm；Utility program includes test data The functions such as generation, the reading of external data.Underlying basis part mainly includes vector interface and fabric interface, both interfaces The linear algebra library Breeze that Scala language will be used to develop based on Netlib and BLAS/LAPACK, MLlib support local Intensive vector sum sparse vector, and support scalar vector, while supporting local matrix and distributed matrix, point of support Cloth matrix is divided into RowMatrix, IndexedRowMatrix, CoordinateMatrix etc., about intensive and sparse type Vector Vector example, dredge matrix containing a large amount of nonzero elements vector Vector calculate in can save a large amount of sky Between and increase substantially calculating speed, scalar LabledPoint is also widely used in practice, for example, judge mail whether be It can be used when spam and be similar to code below: can be normal email the judgement for being expressed as 1.0, and indicate Treat for 0.0 as spam, RowMatrix passes through RDD [Vector] directly to determine for matrix Matrix Justice simultaneously can be for median average, variance, collaboration variance etc., and IndexedRowMatrix is the Matrix with index, But it can be converted to RowMatrix by toRowMatrix method, thus using its statistical function, CoordinateMatrix is usually used in the relatively high calculating of sparsity, is by RDD [MatrixEntry] come what is constructed, MatrixEntry is the element of a Tuple type, wherein including row, column and element value；The core content of MLlib algorithms library For common algorithm in some Spark: sorting algorithm belongs to supervised study, establishes one using sample known to class label Classification function or disaggregated model, application class model can sort out the unknown data of the class label in database, be sorted in It is an important task in data mining, at present for commercial applications at most, common typical case scene has loss pre- Survey, accurate marketing, Customer Acquisition, individual character preference etc.；The sorting algorithm that MLlib is supported has logistic regression, support vector machines, Piao Then plain Bayes and decision tree etc. execute training algorithm, finally in institute if case imports training dataset on training set It obtains and is predicted on model and calculate training error；Regression algorithm belongs to supervised study, and each individual has a phase therewith Associated real number label, and it is desirable that after providing the numerical characteristics for indicating these entities, the label predicted Value can be as close possible to actual value；MLlib supports the linear recurrence of regression algorithm, ridge regression, Lasso and decision tree etc., such as Case imports training dataset, is resolved to the RDD of tape label point, is built using LinearRegressionWithSGD algorithm A vertical simple linear model carrys out the value of prediction label, finally calculates mean square deviation and carrys out the identical of assessment prediction value and actual value Degree；Clustering algorithm belongs to non-supervisory formula study, is normally used for the analysis of exploration, is the principle according to " things of a kind come together, people of a mind fall into the same group ", will The sample of classification is not gathered into different groups in itself, and the set of such one group of data object is called cluster, and to each The process that such cluster is described, it is therefore an objective to so that belonging between the sample of same cluster should be similar to each other, and different clusters Sample should be sufficiently dissimilar, and common typical case scene has customer segmentation, client's research, the market segments, value assessment, MLlib supports widely used KMmeans clustering algorithm at present, if case imports training dataset, is come using KMeans object In data clusters to two class clusters, required class cluster number can be passed in algorithm, and it is total then to calculate mean square deviation in collecting (WSSSE) can reduce error by increasing the number k of class cluster, and actually optimal class number of clusters is usually 1, because this It is a little usually " low valley point " in WSSSE figure；Collaborative filtering is often applied to recommender system, these technologies are intended to supplement use The part lacked in family-commodity association matrix, MLlib currently support the collaborative filtering based on model, wherein user and commodity It is expressed by the hidden semantic factor of a small group, and these factors are also used for the element of prediction missing, is trained as case imports Data set, the every a line of data are made of a user, a commodity and corresponding scoring.Assuming that scoring be it is dominant, using silent The ALS.train () method recognized assesses this recommended models by calculating the mean square deviation of the scoring predicted；Utility program Part includes the binary and polynary analyzer, a variety of Data Generators, data loader of the validator of data, Label. 2.0 version of Apache Spark is summarized the experience based on the community Spark is past, there is many characteristics: ANSI SQL with The API that more reasonable API, Spark are created is simple, intuitive, easy to use, and Spark 2.0 has continued this tradition and two A aspect has highlighted advantage: the SQL of standard is supported and the unification of data frame (DataFrame)/Dataset (data set) API； Great expansion has been done to SQL function, has introduced new ANSI SQL resolver, and support subquery function；Spark 2.0 can To run all 99 TPC-DS inquiries, since SQL is Spark using one of used primary interface, to SQL function It expands and substantially reduces work required when legacy application is transplanted to Spark；Programming API is rationalized: in Scala/Java Middle to have unified DataFrames and Dataset: DataFrames is row (row) data set since Spark 2.0 Typealias；The either type method or select, groupBy etc of mapping, screening, groupByKey etc The class that can be used in Dataset without type method, the Dataset interface of this new addition is used as Structured Streaming's is abstract, type safety (compile-time type-safety) when due to compiling in Python and R language It is not belonging to characteristic of speech sounds, the concept of data set can not be applied in these language API；And DataFrame is still main programming It is abstract, the concept of single node DataFrames is similar in these language；SparkSession replaces as a new entrance The SQLContext and HiveContext of script；For the user of DataFrame API, Spark common chaotic source Which " context " head from using；SparkSession can be used later, it can be compatible with as single entrance The two；The SQLContext and HiveContext of script still retain, backward compatible to support.Devise one it is new Accumulator API, it is not only more succinct on type hierarchy, it is also special to support fundamental type；The Accumulator of script API has not been used, but is still retained in order to backward compatible；Machine learning API based on DataFrame will be used as main ML API occurs: spark.ml packet and its " pipeline " API can occur as the main API of machine learning in Spark 2.0, although The spark.mllib packet of script still retains, but later developing focus is concentrated on the API based on DataFrame；Machine Device learns pipeline persistence, and present user can retain and be loaded into the pipeline and model of machine learning, and Spark is to all language It provides and supports；Increase to generalized linear model (GLM), NB Algorithm (NB algorithm), survival regression analysis The support of the distributed algorithm of (Survival Regression) and clustering algorithm (K-Means) R language.Use Spark as Compiler, Spark 2.0 are equipped with second generation Tungsten engine, which is according to modern compiler and MPP database For theory come what is constructed, the main thought that these theories are used for Data processing is exactly at runtime using the word after optimization by it Code is saved, whole inquiry is synthesized into single function, virtual function calling is not used, but registers mediant using CPU According to；This technology is referred to as " whole-stage code generation ", uses a quantity fast before rear speed ratio Grade；In Catalyst optimizer optimizer to the effect of such as nullability propagation etc common query Aspect, there are also improvement；In addition vector quantization Parquet decoder is also improved, so that handling capacity increases three times.As The first tool for attempting unified batch processing and stream process calculating, Spark Streaming are always the leader of big data processing, First stream process API is called DStream, introduces for the first time in Spark 0.7, it provides some powerful spies for developer Property, comprising: only once semantic, large scale fault-tolerant and height are handled up；Spark 2.0 uses a new API: Structured Streaming module handles these use-cases, and there are three main improvement compared with existing streaming system: with batch The API of processing job integration: want operation flow data and calculate, developer can write for DataFrame/Dataset API and criticize Processing calculates, and process is very simple, and Spark can execute calculating in flow data mode automatically, that is to say, that inputs in data When real-time update result；Powerful design enables developer need not take any trouble about controlled state and failure, without ensure using with batch at The synchronization of operation is managed, these are all solved by system automatically；In addition, being directed to identical data, batch processing task can provide identical Result；It can be handled in entire engine and storage system with the transaction interaction of storage system: Structured Streaming The problem of fault-tolerant and persistence, enables programmer easily to write application, the database of real-time update is enabled reliably to provide, Static data or mobile data is added；Deep with other components of Spark integrates: Structured Streaming is supported The interaction inquiry that flow data is carried out by Spark SQL can add static data and much use DataFrames Library, moreover it is possible to allow developer to be able to construct complete application, and more than pipeline of data flow；Future will have more with MLlib and The integrated appearance of other libraries；Spark 2.0 is equipped with alpha editions Strutured Streaming API, this is One (extra small) expanding packet being attached on DataFrame/Dataset API；After unified, for existing Spark user It uses very simply, they can utilize the knowledge in terms of Spark batch processing API to answer real-time new problem；This In crucial function include: to support the processing based on event time, unordered/delayed data, sessionization and non-streaming Formula data source and Sink's is closely integrated；Have updated Databricks workspace also to support Structured Streaming, such as when starting streaming inquiry, notebook UI can show its state automatically.User is initially use Spark be because of its ease for use and high-performance, Spark 2.0 reached in these areas before twice, and increase pair The support of a variety of workloads.And Apache Spark 2.2.0 version is even more one of Structured Streaming important Milestone, because it can formally be used in production environment finally, experiment label (experimental tag) is moved It removes, supports to operate free position in streaming system；Streaming the and batch API branch of Apache Kafka 0.10 Hold read and write operation；In addition to outside addition new function, which, which more works, is being inside SparkR, MLlib and GraphX The polishing (polish) of the availability (usability), stability (stability) and code of system is simultaneously solved more than 1100 A tickets；New features include: that the production environment of Structured Streaming supports ready, the function of generalized Petri net Can, new distributed machines learning algorithm is introduced in R, is added to new algorithm in MLlib and GraphX.

Structured Streaming is introduced since Spark 2.0, provides high-level API to construct Streaming application, it is therefore an objective to provide a kind of simple mode to construct streaming application (end-to-end end to end Streaming applications), provide consistency guarantee and fault-tolerant way；Since Spark 2.2.0, Structured Streaming has been that the support of production environment is ready, in addition to removing experimental label, further includes Some high-level variations, such as: the streaming of Kafka Source and Sink:Apache Kafka 0.10 and Batch API supports read and write operation；Producer in Kafka Improvements:Kafka to Kafka stream operation is supported Caching is to realize low latency；Additional Stateful APIs:[flat] MapGroupsWithState operation support is again Miscellaneous state processing and timeout treatment；Run Once Triggers.Since Spark 2.0 is issued, Spark has become greatly Function is most abundant in data fields and one of standard compliant SQL query engine, it can connect various data sources, and SQL-2003 standard sentence, including analytic function and subquery can be executed in these data, Spark 2.2 is also added Many SQL new functions, comprising: API updates: having unified the CREATE TABLE grammer of data source and hive serde table；SQL Broadcast (broadcast hints) such as BROADCAST, BROADCASTJOIN and MAPJOIN is supported in inquiry；It is overall Performance and stability: filter, join, aggregate, project and limit/sample operation are supported excellent based on cost The radix for changing device counts (Cost-based optimizer cardinality estimation)；Using star-like heuristic (star-schema heuristics) Lai Tisheng TPC-DS performance；CSV and JSON file listing/IO performance boost； HiveUDAFFunction support section set；Introduce the aminated polyepichlorohydrin symbol based on JVM object；Other changes to merit attention: The JSON and csv file for supporting parsing multirow, analyze the order of partition table.Last big main collection of variation of Spark 2.2.0 In in advanced analysis, MLlib and GraphX be added to new algorithm below: local sensitivity Hash (Locality Sensitive Hashing), multistage logistic regression (Multiclass Logistic Regression), personalization PageRank (Personalized PageRank)；Spark 2.2.0 is also added to following distributed algorithm in SparkR: alternately minimum Two multiply (ALS, Alternating Least Squares), isotonic regression (Isotonic Regression), Multilayer Perception point Class device (Multilayer Perceptron Classifier), random forest (Random Forest), gauss hybrid models (Gaussian Mixture Model), linear discriminent analyze (Linear Discriminant Analysis, LDA), more Grade logistic regression (Multiclass Logistic Regression), gradient boosted tree (Gradient Boosted Trees)；Structured Streaming API supports R language: supporting to_json, from_json in R, supports Multi- column approxQuantile；With the increase of these algorithms, SparkR has become most comprehensive distributed machines in R Learning database.

Although Apache Spark just obtains splendid popularity in a short period of time, it is also not flawless 's.Apache Spark mainly supports three kinds of distributed deployment modes, be respectively standalone, spark on mesos and Spark on YARN, wherein independent deployment is most simple direct method, and latter two deployment way is complex, for not having It is difficult for experienced new hand, it can be potentially encountered some problems when installing and relying on, if incorrect Spark Application program will work in the independent mode, can encounter the problems such as class.path is abnormal when running under cluster mode；Apache Spark be it is existing for processing mass data, computational efficiency is higher but (spark cluster memory is very to save as cost in eating Can reach when more several T even hundred T), so monitoring and measurement memory use be it is vital, have in Spark and much match Setting can be adjusted according to use-case, default configuration be not necessarily it is best, so suggest user to read over The document of Spark memory configurations, makes adjustment in time according to their own needs, avoids the occurrence of memory problem；Apache Spark Either 1.x.x version or 2.x.x version all follow always three or four months release cycles, though version quickly changes In generation, represents the vigor of Spark and the ability of developer's functional development, but it also implies that the variation of API, for not wishing It hopes for the user of API variation, frequent version is issued instead at a great problem, or even in order to ensure Spark application program It is not influenced to have to increase additional expense by API change；Apache Spark supports Scala, Java and Python, right Developer is very humanized, but the status of this three is not identical, especially when being related to new function, Java and Scala can always update at the first time, and the library Python needs some times that can just catch up with newest API and function, so User is when selecting the Spark of latest edition, it should consider to realize using Java or Scala first, if selecting Python It need to consider whether supported in feature/API；Document study course and code rehearsal are for new hand's fast lifting ability or portion Administration's system is extremely important, although the sample of Apache Spark can be shared out together with document, most example is all It is very basic, there is the depth sample of quality seldom, reference significance is also not very big；In addition, Apache Spark is a good Big data, which handles frame, can also select simpler solution but if data do not reach a certain amount grade.

Artificial intelligence rose and fell in past 50 or six ten years, but really large-scale application be from internet namely No matter (because being considered as Internet era in -2009 years 2000) started for 2000, search for, advertisement or electric business have A large amount of artificial intelligence technology, but these technologies are mainly used in backstage, are not necessarily so easy perceived；From 2010 with Come, the generation of big data, the development of computing capability, bandwidth, deep learning these technologies are so that artificial intelligence starts from walking from the background To foreground, the speech recognition of today has been that tentacle is reachable, and develops very fast, image aspect, natural language understanding, machine Device people's technology is not always the case；About artificial intelligence, there is also many different views, if any strong artificial intelligence and weak artificial intelligence Say, we are it also seen that many film and novel etc., to today also none generally acknowledged unified definition；Fusion development layer Face, the development of artificial intelligence technology promotes the depth integration of a variety of science and network technology, such as in the U.S., Europe and Japan Artificial intelligence technology is quickly grown, and has driven the development of much information scientific domain, informatics, control, bionics, The technological break-through in the fields such as computer is applied in artificial intelligence application；On technology development grain, artificial intelligence is very More technologies are constantly in the forward position of innovation, can largely influence the developing direction of information industry；But artificial intelligence technology The unprecedented attention of various circles of society is obtained, most important reason is exactly big data, and second the reason is that computing capability, third are former Because being exactly deep learning.

Deep learning innovatively brings the New Wave of machine learning, pushes " big data+depth model+data discovery The arriving in excavation " epoch provides a set of frame, i.e., there are two a set of language of condition actually as machine learning System: first will have the ability for controlling this model and calculating, and second is to have enough understandings to problem；Most successful one A example is convolutional neural networks, it is really and we are thin to the visual cortex of the understanding especially early stage of visual system Born of the same parents' relationship is very big；Say that the principle of deep learning is also very basic from statistics and the angle calculated: a machine learning system, it can Decomposition, understanding and control can be done to each source of its error, so as to control whole prediction error；Engineering Habit, which generally requires, to be done some it is assumed that but all hypothesis are not that perfectly (model is imperfect, data are imperfect and calculates endless Beauty, statistics are typically concerned about the first two), with limited computing resource processing problem is gone in reality, it is necessary to consider to calculate endless Beauty needs to be gone to reduce deviation with this extremely complex model of unbiased big data；So deep learning and traditional artificial intelligence It can compare, data can be absorbed and increase bring bonus, traditional artificial intelligence model may not enough (such as one linear for complexity Model, data volume increases rear deviation can be bigger) or model very well but calculate the problem of can not solve.Currently, each grand duke Department constantly tries to explore in deep learning field, and research, which is laid particular emphasis on, establishes " neural network " that simulation human brain carries out analytic learning, As a focal point in machine learning research；The giant company for possessing powerful calculating ability and data resource, also because of depth Degree study enters a new stage taken the lead in race comprehensively, and the laboratory of Google Silicon Valley upgrades deep learning algorithm, parses network Relationship between content, the intelligency activity for imitating human brain to machine learning is furtherd investigate, so that machine identifies figure as human brain Picture understands natural language.

The lengthy and jumbled of common data, dimension and diversification are inadequate, cannot cover the various boundary feelings that real world is likely to occur Condition, effectively uses data, and the more preferable value for playing data is not easy to；Data scale is increasing, especially big data and big rule The value of modulus evidence is increasingly prominent, tissue (enterprise or government department etc.) need using data and big data analysis (low cost, It takes the lead in possessing big data, vitalize these data assets, find its inherent law, develop the big data application mould of different field Formula, business model), but information content much surmounts the bearing capacity of existing most of enterprise IT architecture and infrastructure, in real time Property require also to have surmounted existing computing capability significantly, the features such as the quick of big data, large capacity and diversity so that storage, The problem of calculating, security and privacy, analysis and application aspect, is amplified, the cloud computing architecture that can such as extend on a large scale, number The huge capacity demand migrated in the properties of flow and cloud acquired according to the diversity of source and format, data；Big data+depth mind Machine learning model through network is huge, and data volume and calculation amount are very big, and the model training time is long, needs in distribution system Accelerated on system by parallel computation, and the design of distributed system also will appear a series of problems；Model itself is largely counted It calculates parallel and data parallel to be suitble to be accelerated using the hardware configuration of GPU, also encounter when solving the problems, such as big data by calculating Bottleneck caused by the parallel cutting of method itself；Hardware system is also required to the artificial intelligence ability for having certain.

Summary of the invention

Data (especially big data) how to be solved in the double challenge of technology and application aspect, make collection, the pipe of data Reason, analysis, excavate, using etc. various aspects can it is more preferable, be easier, more efficiently carry out, to more effectively play the valence of data Value.

" digitization " is internet development to a kind of presentation or feature in stage now, whether traditional data, magnanimity number According to or big data, core is all data itself, and increasingly scale, structure is complicated change and value diversity, area It is not concentrated mainly on several aspects such as source, feature, adapted tool and value.

This programme provides a kind of method for effectively playing data value, and the method or model of specific code requirement will not Congenerous unit (or being service) connects by defining good interface or contract and to form system, different function units (or being service) can be each process of data processing and application.

The method or model of specification have several key points: from the angle design application software of Services Integration, and considering multiple It is preferential to use open source, alternative technology and methods (such as message mechanism) with existing service；It describes a set of perfect Development mode is connected in service to help to apply, and the formulation of these modes services for description, notifies and finds service and takes A series of mechanism that business is communicated such as service API；Service can not only be defined as function but also can while externally be defined as object, Using etc., it is adapted to any existing system；Service is used as core, puts together application program and provides more abundant, purpose Stronger business procedure, is more truly reflected and the combination of business model；Service same business, process combines, more smart It really indicates business model, preferably support operation flow (to technology from the perspective of operation flow)；Process assigns business Component in model more clearly from defines the relationship between them with life, and definition interacts operation with business model Ad hoc approach, it is ensured that easily called with other modules, application or module, between；Building is various such Service in system can be interacted with a kind of unification and general mode；It can be SOA, SOA and system can also be integrated Framework is preferably to merge and mutually promote.

Interface or contract can be defined by the way of neutral, hardware platform, operation independently of service of realizing Service is such as packaged into a series of API, can also opened away for third party by the method or mechanism of system and programming language It uses, opening API is provided.

It, can be for more using thus more effectively after each process of data processing and application forms data service Ground plays data value；The realization of data service needs first explicit data domain, i.e., sorts out to data, is all data skill The basis of art specific implementation, with help to select matching between various processing modes and architecture (tool or platform) and Requirement of the Various types of data to them, data type also determine need to be stored with which type of framework (tool), handle with And analysis, application；Framework can be open source or business system, frame, library or kit etc..

Had according to time span required for processing data to define data: (financial stream, enters Complex event processing in real time Invade detection, fraud detection), near real-time (advertisement delivering), batch processing it is (retail, evidence obtaining, bioinformatics, geodata, a variety of The historical data of type) etc..When selecting data technique (especially big data technology) platform of an application program, the one of reference A principal element is the requirement of delay, and delay is the time for completing to spend needed for calculating to make decisions, and emphasis is needed to consider Both direction: the requirement of low latency and high latency；Low latency application program, which is believed that, is needed the response time in a few tens of milliseconds Within, that is, " real-time " response is needed, it is on the contrary then high latency can be regarded；If you do not need to low latency, most by data collection to magnetic The traditional approach for then completing to calculate to data again on disk or in memory is sufficient；And low latency requires to generally mean that Data must just be handled after entering.

The mode that big data is mapped to domain is had according to data structure or the grade of tissue: structuring (be sold, be financial, Bioinformatics, geodata), semi-structured (Web log, Email, document), unstructured (image, video, sensing Device data, webpage) three kinds.

From inspection manufaturing data and the industry type for extracting information from data can also be needed to be bound, industry It requires greatly to handle all type of data structure in application scenarios and copes with the demand of various response times, in this feelings Industry field is a quadrature axis for describing big data under condition；This can intuitively be showed by some most common application scenarios A little domains and including them and industry, the mapping of time and structure.

If an application program needs to respond " immediately " when each event occurs, i.e., low latency is handled, and is generally needed Want some form of streaming computing and its relevant calculating and storage architecture.

If application program may tolerate high latency (for example, it does not need to generate result in several seconds or a few minutes), It can so consider the calculation towards batch processing.

" digitization " bring great change in recent years is: data volume it is explosive increase severely and the diversification of data source with And isomerism, big data become this epoch proper noun, big data can be managed simply compared to traditional data Solution is the increase that is data volume, when data scale is increasing, greatly to a certain extent when can not just pass through tradition or manual method Obtaining, manage, handle within the reasonable time and arranging becomes to the valuable information of the mankind.

In order to cope with big data challenge, data are effectively applied, the value of data is preferably played, are needed with some spies Fixed method, theory or tool, model.Data processing is with batch mode or for real-time/near real-time of flow data Processing (data continue into and need to immediately treat) determines the calculating mode of big data, it may be considered that two dedicated bases Plinth framework: for the Hadoop and the Spark that handles in real time of batch processing, it is related to building and benchmark test, reality for distributed type assemblies Existing one parallel algorithm based on Mapreduce, realization and analysis collaborative filtering, operation and learning classification algorithm, deployment The data manipulation etc. of Hive and realization one is promoted effectively using data, the ability of performance data value.Big data introduces When into the application program of SOA, it is important to define framework model, there is horizontally and vertically two kinds of selections.

When data have capacity, speed and multifarious challenge, need to carry out horizontal extension between multiple servers, by data It is stored in the distributed structure/architecture of different location, that is, relies on distributed treatment, distributed data base, cloud storage and the void of cloud computing Quasi-ization etc. overcomes mass data (especially big data) to be difficult to store and can not be pushed away with human brain with hardware system the relevant technologies The problem of calculating, estimating or handled with the computer of separate unit；Technology suitable for big data includes MPP (MPP) database, distributed file system, distributed data base, cloud computing platform, internet and expansible storage system Deng.

Horizontal SOA data policy can be applied to the abstract data for being applicable in big data, and common method is MapReduce, It can be applied to the cloud framework of Hadoop form, data can be distributed, manage and be accessed to Hadoop and similar method, so The correlated results of this distributed information of Integrated query afterwards；SOA is abstract, but when it is abstract conceal bottom influence performance and When the complexity of response time, this abstract degree of danger can be improved, and data access is also such.

Such design brings the universal facing challenges of all distributed computing systems again: CAP is theoretical；Common data Library model respectively has feature, has also carried out different degrees of choice to consistency in CAP theory and availability respectively, database Standard index is ACID attribute, and it is the traditional approach that relational data is taken that ACID, which is absorbed in consistency,；In order to handle interconnection There is another design concept, referred to as BSAE in the demand of net and the storage model based on cloud computing: basically available, soft shape State and final consistency, most NoSQL are all based on BASE principle, that is, select C or A in CAP theory；Framework Teacher needs the equilibrium relation considered in earnest Chou Xiang between performance, and optimizes for its specific business demand.

In addition to storing data and realize fault-tolerant different technologies, performance reference also to cope in real world all kinds of scenes with All modes are come the fact that access and keep data；Generally it is difficult to have explicitly performance under fast-developing industry situation It illustrates, performance depends critically upon the mode that used query engine, storage architecture and data are saved.

The realization of data service is to assist or substitute the mankind to complete the storage of data, management using framework (tool or platform) And analysis, using etc. different levels efficient process, effectively use, play value etc. work multi-stage data service (especially Big data technological service)；Specifically include that data processing and its application based on infrastructure service, with big data analysis, data dig Middle rank service based on pick, the high-level service of data depth application and more advanced service etc..

Data processing and its application based on infrastructure service, mainly have: data prediction, data processing, data management and It is applied and lays in for more advanced data service.

Some applications need to carry out multiple data such as general arithmetical operation and working process to related data to locate in advance (data processing) step is managed, as data acquisition, data conversion, data grouping, data organization, data calculate, data Storage, data retrieval or data sorting etc..

Data processing is process of the data conversion at information, mainly carries out processing and sorting, process to various forms of data Differentiation and derivation comprising collection, storage, processing, classification, merger, calculating, sequence, conversion, retrieval and propagation to data etc. Overall process；It calculates and is generally straightforward in treatment process, and process and calculate because business is (such as wechat, microblogging, APP client The data of data information, the data of internet-of-things terminal or pretreatment system itself) difference and it is different, need according to business It needs to write application program and solve.

Data management be data are effectively compiled on the basis of data processing, organizes, store, safeguarding, The process of the operations such as retrieval, transmission, processing and Data Security, it is therefore intended that effectively done using data for sufficiently comprehensive It ensures, the key for realizing that data effectively manage is data organization.

It effectively organizes, managed data, best mode is general, comprehensive using one, easily extends, user Just and efficiently management software or system architecture effectively manage data and its treatment process to get up.

The Hadoop ecosphere (or general ecosphere) be essentially all in order to handle be more than single machine processing capacity data and Mature, the reliable tool of birth, effectively storage, the quickly tools such as calculating, quick search, random challenge, correlation inquiry make With or combination, collocation use, be able to solve mass data (especially big data) pretreatment, processing and management process in institute It is problematic, the data service on basis can be provided and laid in for more advanced data service.

Based on big data analysis, data mining middle rank service mainly have data analysis, big data analysis, statistical technique, The many levels such as data mining, big data visualization.Spark and Hadoop is substantially provided to solve a distributed meter The two kinds of Computational frames calculating efficiency and being born, can meet the stage data based on big data analysis, data mining It services and is laid in for more advanced data service, be only applicable in scene and slightly have difference.

Apache Spark is originally the computing engines for the Universal-purpose quick for aiming at large-scale data processing and designing, now also Through forming a high speed development, the widely used ecosystem, the tool of complete set, user are provided for big data processing Operated on large data sets be not required to completely consider underlying basis framework, it can help user carry out data acquisition, inquiry, Processing and machine learning, or even abstract distributed system can also be constructed, can be used for the integrated treatment of big data and answered With.

Data analysis has many levels: descriptive analysis includes average value, standard deviation, year-on-year, development-speed with link-relative method, quartile Number, mode etc.；Mathematical Statistics Analysis includes Sampling Estimation, it is assumed that is examined, variance analysis etc.；Big data analysis requires machine tool There is an analysis ability, core objective is to excavate hiding rule in, the numerous data of structure huge from the scale of construction, to make data Play maximized value；The template or rule for being worth reference can also be found, is converted into valuable from mass data Information is seen clearly or knowledge, and more new values are created；Focus on the application of algorithm, machine learning algorithm can be with automatic and can The mode of extension carries out seeing clearly analysis to collected large-scale, multidimensional data, broadly refers to computer with automatic to data Carry out pattern learning and obtain the ability of inference, algorithm classifications can be carried out according to many different dimensions, as supervised study, Semi-supervised learning, intensified learning, the study of non-supervisory formula.

Data involved in big data analysis are often also not only that data volume is huge and multidimensional, computational problem are also required to emphasis Consider, machine learning algorithm must also pay close attention to the computational problem that conventional statistics formula is ignored, and statistical technique is in many aspect quilts It is considered as a subset of machine learning techniques.

The algorithm classified in a manner of machine learning is suitable for simple, structuring data set, complicated, non-structured data Collection need further to define could therefrom value -capture, that is, need a full set of technology deduced and predicted from data, can To be referred to as data mining (Data Mining)；It is by analyzing each data and finding from mass data its rule, Mainly there is data preparation, rule searching and rule to indicate and etc., it might even be possible to including the visualization skill due to reasoning and prediction Art；Need to use the technology of a large amount of machine learning, data analysis and data management aspect；Engineering in data mining process Algorithm is practised to be used as extracting the tool of the potential valuable mode in data set；Main method has clustering, classification analysis (decision tree, neural network, support vector machine, random forest), correlation rule, collaborative filtering, anomaly analysis, special cohort analysis With evolution analysis etc..

The data classification of the multiplicity such as time series data, flow data, sequence data, diagram data, spatial data, multi-medium data is very It is easy to be mapped to the specific application scene of the different vertical segmented industry, as financial industry will use time series data or social network Diagram data in network；Some classifications may also can cover a variety of data, and such as extensive scientific algorithm can be related to all data class Type and mining algorithm；The diversity of data can be divided into multiple wide in range classifications, and all kinds of machine learning algorithms also can be with Various ways are adapted to and/or are reinforced to be applied on a variety of non-structured data sets；Therefore data mining also has Multifarious feature.

Data mining and data analysis both it is closely coupled, with circular recursion relationship, data analysis result need into One step, which carries out data mining, could instruct decision, and data mining carry out value assessment process be also required to adjust it is prior-constrained And data analysis is carried out again；Data analysis is the tool for data being become information, and data mining is information to be become cognition Tool, if to extract from data, certain rule (recognizing) generally requires data analysis and data mining is used in combination.

Classification is a kind of very important method of data mining, is one classification letter of association on the basis of data with existing Number constructs a disaggregated model (classifier (Classifier)), and the function or model can be the data in database Some being mapped in given classification is noted down, is predicted so as to be applied to data；A data can also be simply not understood as Sample is mapped to the learning process in the class of a predefined, i.e., the attribute vector of given one group of input and its corresponding class, Classification is obtained with the learning algorithm based on conclusion, there are several steps: establishing the classifier of training set, assesses the prediction of classifier Accuracy rate, then its class label is predicted to new data.

Classifier is the general designation for the method classified in data mining to sample, includes decision tree, logistic regression, simplicity Bayes, neural network scheduling algorithm；Prediction can be related to data value prediction and class label prediction, but prediction is often referred to value prediction； Classification is the class label for prediction data object, and prediction is the certain vacancies of estimation or unknown-value；Classification and prediction can pass through Five main parameters are assessed: accuracy, speed, robustness, scalability, interpretation.

The process for recommending class application can be subdivided into multiple portions: understanding business, analyze data, the pretreatment of data is special Sign is extracted, model construction, model evaluation, model deployment, application, assessment application effect etc.；Whole process needs constantly repeatedly, Model is also required to constantly adjust, until reaching ideal effect.

In business data application, the data for carrying out science to the key link of service value chain are analyzed, and can help to be promoted Insight establishes the competitive advantage of differentiation；The big data application of enterprise's first choice is the big data battalion based on customer behavior analysis Pin, secondly products innovation, risk profile, supply chain management, customer service etc.；Big data marketing aspect has: based on client trading The cross-selling (Products Show) of behavioural analysis, community's marketing based on the analysis of client's Social behaviors, based on the wide of data analysis It accuses and launches, trend prediction and fashionable marketing based on community's hot spot are different based on client based on the price fixing of data analysis The customer churn prediction etc. of Chang Hangwei.

There are many situations for big data visualization: space layout visualization be abstracted/summarizes visualization, interactive mode/Real time visible Change.Space layout visualization is a calculation method specifically put being mapped to a data object in coordinate space, main Wanting starting point is that the cognitive ability of the mankind is readily appreciated that the information organized based on space；Line chart, item can be used Shape figure, scatter plot etc., show the complex relationship in data if necessary, and data object is there are when hierarchical structure, generally with tree-shaped Figure is allowed to visualize；There are also figure or network topology visualizations, obtain interesting opinion in data with node and boundary, such as Power is oriented to figure rendering algorithm.Be abstracted/summarize visualization be by visualize routine rendered before can to mass data into Row handles and be abstracted/summarizes, and such as data histogram is classified or is showed in a manner of data cube, such as many clustering algorithms The data based on being classified are extended with novel concept, the new advantage of bring is in a manner of more compact, dimensionality reduction To show data.Interactive mode/real-time visual can meet the interaction demand of user in real time, support client to data in real time Exploration, even complicated Visualization Mechanism will also be completed within one second；Allowing user, quickly discovery is heavy in data The opinion wanted, for being that driving is very crucial see clearly the industry of analysis with data.

Data safety is emphasis and difficult point in data management, is not only related to individual privacy, Enterprise business privacy, but also Data safety will affect national security；Data Security has many aspects, as system safety, technical security, operation safety, Storage safety, transmission safety, products & services safety etc., system is treated the symptoms safely, and technical security effects a permanent cure, other are also safely must Indispensable link；When system has decentralization, data have the case where irreversible, can not to distort with demands for security such as anonymities Under, block chain technology can be used.

Architecture safety, the data that the safety of big data and privacy challenge can decompose the big data ecosystem are hidden In private, data management, integrality and reaction equation safety etc..Security protection is carried out to the architecture of big data system to set Meter is related to the security protection stored to distributed computing and data；It is also critically important to the protection of data itself, because being passed in information To ensure the protection to privacy during broadcasting and sensitive data is protected by using the access control of encryption and granulating； Management mass data for expansible and Distributed-solution be not only to data storage carry out protect be also required to its into The effective audit of row and data source investigation；Finally in order to ensure the safety of architecture, it is necessary to the number from multiple terminal Integrity checking is carried out according to stream so as to be used for analyzing security incident in real time；Therefore safety and privacy are also passed through It is through in every level one data service, needs specifically to handle safety problem brought by the data under a variety of situations.

The high-level service of data depth application includes nerual network technique, and especially deep learning technology etc. is in big data side It the application service in face and is laid in for more advanced data service.Big data makes that there are enough data (empirical datas) It allows a computer system to have an opportunity to become more and more intelligent, so that machine is more understood people, this is properly termed as big data thinking: nerve Network is trained by data come the parameter of those tera-scale is all random initializtion when training starts, and cannot be completed Identification mission；It by ceaselessly giving data, trains for one time one time, can be only achieved final identification model；Data are trained Key, data volume large-sized model could be good, and data volume is too big to make the training time very long (related with hardware system)；It trained Journey has the training data of supervision tagged；Unsupervised training only has data not have label, is applicable in very much big data, the reality of magnanimity When data, not tag can identify can make machine reach the perfect condition of understanding and cognition, can be used to solve true The identification of ultra-large object in the world: it is directly trained by using not tagged data, is intervened not needing artificial model In the case where learn rule, mode and feature etc. in large-scale data, understand that usual human brain directly cannot be extracted and be taken out The problem of as coming out, hiding value can be sufficiently excavated, solving practical problems service for the mankind；Many deep learning algorithms It is semi-supervised learning algorithm, for handling the large data sets in the presence of a small amount of non-mark data；

Deep neural network successively extracts feature, and is that computer is automatically extracted from data, and not needing human intervention, it is mentioned Take process.Deep learning training process is more complicated, specifically builds the neural network of a plurality of layers on computers, only It needs that how many layer formulated, does not need to give specific parameter, computer learns final net by calculating big data automatically Network parameter, different network parameter can identify that different objects, trained network can automatic identification objects；Base This ontology model establish after, can by machine learning extend, including Ontological concept extension and its between relationship, be One of the basis of semantic analysis and decision support, planning application, the basis of corporate resources scheduling, this application to big data Extension has realistic meaning very much；Identification, information conversion and the effective use of the data such as sound, symbol, text, image, video Also correlation is able to achieve by nerual network technique, especially deep neural network technology with obviously realistic meaning Data service, preferably play value；SpeeDO can be used to implement the advanced data services of profound application, provide Based on neural network, the especially artificial intelligence technologys such as deep neural network (deep learning) advanced data services API and It is laid in for more advanced data service.

Deep neural network needs to carry out large-scale data training on computers, and unsupervised learning is comparatively ideal side Formula, but data volume involved in big data+deep neural network machine learning model and calculation amount are very big, model training It needs to accelerate to reduce the time by parallel computation in distributed system；The design of distributed system needs DNN algorithm expert With the collaboration of systems specialists, it may should modify algorithm and be allowed to match with bottom hardware framework, and need systems specialists design meter The powerful single machine device of calculation ability designs the server of high density integration, efficient communication again, and bring challenge has many levels: Algorithm, application and Hardware Design, need from realize accelerate, hardware system is built with large-scale application etc. simultaneously It solves；Implement the solution based on DNN, can first build the hardware system, high-level of data (especially big data) dependence System, then DNN model or algorithm are realized in software view, the application of different scales, different scenes is finally realized, such as voice and figure The identification of picture.

Hardware Design can have multiple directions, using the realization based on cpu server (cluster) or can both be based on The system of CPU+GPU is realized, the solution based on APU can also be used, the exploration that can also be carried out from more bottom hardware and Research, such as realizes on FPGA or designs special integrated circuit；The ideal that APU can provide a high density low-power consumption is hard Part environment, the insufficient memory that can solve CPU+GPU cluster in big data are asked with the too high bottleneck of node communication cost Topic；The support that the method for mind over machine needs machine hardware to evolve such as learns the knowledge of this machine image based on deep neural network Other method of thinking, it is proposed that a kind of machine evolvement method, it is believed that this is the sheet that hardware system has artificial intelligence Neck.

When data are enough, computer can be obtained in the case where not knowing about logic of questions, specific causality Conclusion provides more structurally sound as a result, to understand that the world provides a complete new way out, this be to give up causality, With the mode of thinking of correlativity, and core is widely to predict；When data are increasing, data accumulation is from quantitative change Process to qualitative change forms data-driven, and data itself (rather than algorithm and model used in data) can guarantee The validity of data analysis result obtains the conclusion closer to the fact, and therefore data can become new productivity, and key is not It is tool again.

With the development of every technology, requirement of more and more big data technologies for data structure is substantially reduced, people The information of the various dimensions such as the social information, geographical location information, behavioural habits information, the preference information that leave can be real When handle, to the full extent using record human behavior data analyzed, solid completely sketch the contours of each individual Various features；These most data related with the mankind, can be used for solving the problems, such as people, such as establish people-oriented Personal data center, so that respective services related with people are improved.For some aspects, data especially big data Not " big ", and it is " useful ", core value is to create, and is to fill up the numerous blank being also not implemented, valence It is much more important than quantity to be worth content, application cost；Data especially big data industry realizes that the key of profit " is added to data " increment " of data may be implemented in work ability ", " processing ", such as passes through intersection, the big data of comprehensive every profession and trade, it is possible to create new Insight；When data are increasing, data are enough, structure more diversification, every profession and trade data can intersect, integrate and make With when, the hiding creative value of big data itself can be bigger, needs to obtain and preferably plays to provide more advanced number According to service: in addition to trade marketing problem, people-oriented big data service, big data can be asked preferably applied to science and technology All kinds of social concerns are inscribed and solved, data is made to play more extensive, bigger value.

When Internet of Things develops to certain scale, by bar code, two dimensional code, RFID etc. can unique identification product, pass The technologies such as sensor, wearable device, Intellisense, video acquisition, augmented reality can realize real-time information collection and analysis, These data can support smart city, wisdom traffic, the wisdom energy, intelligent medical treatment, wisdom environmental protection, intelligence manufacture, machine The theory of people, wisdom business, smart home etc. needs, these " wisdom " application will become main source and the service of big data Range.

The data service of many levels, can be respectively as the basis of higher level data service, can be according to using need Combined application is sought, the boundary between each level can also be reduced by successively laying in.

The method that data value is effectively played using this programme regard service as core, and application program can be put together More abundant, the stronger business procedure of purpose is provided, is more truly reflected out and the combination of business model, it will service envelope It dresses up a series of API to open away for third party's use, after opening API is provided, some third-party developers can be attracted Business application is developed on the platform, and platform provider can obtain more flows and the market share, and third party developer is not Need huge hardware that can quickly and easily start an undertaking with technological investment, to achieve the purpose that two-win, opening API is big Platform development, shared approach allow developer to develop a valuable application, and the cost paid is less, and successful chance is more More, the application of this mechanism only by adjusting service mode rather than can be forced to carry out the exploitation of extensive new opplication code, answer With can more easily use data service, data is enable more effectively to be used, becomes various industries and operation function neck The important factor of production in domain, by influencing economic life, political games, social management, scientific and technological progress, culture and education scientific research, doctor Treat health leisure etc. industry, data will be led to the problem of with everyone closer contacts (such as trade marketing, people-oriented Big data service, big data will preferably be applied to matter of science and technology and solve all kinds of social concerns), with the clothing of people, The life styles such as food, shelter, row, health, consumption, leisure and the business model of enterprise, IT architecture, culture and institutional framework etc. Mixing together mutually inspires exhibition, and data is made really to form valuable assets, to more effectively play value many-sided；It adopts With this method, the enterprise for providing product or service for a large amount of consumers can accomplish precision marketing, quickly know from a large amount of clients Not Chu gold medal client, evade fraud using click-stream analysis and data mining；The middle long-tail enterprise of small and U.S. mode is light Pine is cooked service transformation；The traditional forms of enterprises that must be made the transition under internet pressure can grow with each passing hour and make full use of the valence of big data Value；The enterprise of supply equipment can parse the root of failure, problem and defect in time, may save every year for enterprise billions of Dollar；Loglstics enterprise can plan real-time traffic route for thousands of express delivery vehicle, hide congestion；The enterprise of storage can To analyze all SKU (keeper unit), fixes a price and clear stocks as target using profit maximization；Manufacture the industry such as class enterprise Industry can realize energy-saving, further Improving The Quality of Products with industrial big data, Improve Efficiency；For industrial expansion, Valuable service is provided for the transition and upgrade of industrial enterprise；Machine etc. can rely on brilliant computing capability, utilize depth Learning method is trained by material of mass data, significantly improves degree of intelligence, is promoted to enter the intelligent robot epoch, be made one Brain information be converted to computerized information, the secret of solution open universe origin, allow the mankind to return that origin becomes to the understanding of all things can Energy.

Claims

1. a kind of method for effectively playing data value, different function units (or are by the method or model of code requirement Service) it connects by defining good interface or contract and to form system, it is characterised in that different function units (or be clothes Business) be data processing and application each process.

2. the as described in claim 1 a kind of method for effectively playing data value, it is characterised in that the method for the specification or Person's model has several key points: from the angle design application software of Services Integration, and considers to be multiplexed existing service, it is preferential to use Open source, alternative technology and methods；A set of perfect development mode is described to help using being connected in service, these moulds Formula is formulated for describing service, notice and a series of mechanism for finding service, being communicated with service；Service can be both defined as Function again can simultaneously externally be defined as object, using etc., be adapted to any existing system；Service is used as core, makes application program It puts together that offer is more abundant, the stronger business procedure of purpose, is more truly reflected and the combination of business model；Service Same business, process combine, and indicate business model more accurately, preferably support operation flow；Process assigns business mould Component in type more clearly from defines the relationship between them with life, and definition interacts the special of operation with business model Door method, it is ensured that easily called with other modules, application or module, between；Building is in various such systems Service in system can be interacted with a kind of unification and general mode；It can be SOA, SOA and system tray can also be integrated Structure is preferably to merge and mutually promote.

3. a kind of method for effectively playing data value as described in claim 1, it is characterised in that the interface or contract can By using being defined in a manner of neutral, independently of the hardware platform for service of realizing, the method for operating system and programming language or Service is such as packaged into a series of API by person's mechanism, can also be opened away and be used (i.e. offer opening API) for third party.

4. the as described in claim 1 a kind of method for effectively playing data value, it is characterised in that the data processing and answer It, can be for more using to more effectively play data value after each process forms data service；Data clothes The realization of business needs first explicit data domain, i.e., sorts out to data, is the basis of all data technology specific implementation, with side The requirement for selecting matching and Various types of data between various processing modes and architecture (tool or platform) to them is helped, number It is also determined according to type and needs to be stored, handled and analyzed, applied with which type of framework (tool)；Framework can be out Source or business system, frame, library or kit etc.；Data field defines, can according to processing data required for when Between span define, big data is mapped to by domain according to the grade of data structure or tissue, can also be from checking manufaturing data And the industry type for extracting information from data is needed to be bound；The choosing of data technique (especially big data technology) platform It selects, whether a principal element of reference is the requirement of delay, needed according to application program " vertical when each event occurs It i.e. " responds, some form of streaming computing and its relevant calculating and storage architecture may be selected, otherwise consider at towards criticizing The calculation of reason；Big data be along with data volume it is explosive increase severely, the diversification of data source and isomerism and go out It is existing, compared to traditional data, can simply be not understood as the increase of data volume, greatly to a certain extent when just can not Obtaining, manage, handle and arrange within the reasonable time by tradition or manual method becomes to the valuable information of the mankind；It needs With some specific methods, theory or tool, model；Processing be with batch mode or for flow data it is real-time/ Near real-time processing (data continue into and need to immediately treat) determines the calculating mode of big data, it may be considered that two specially With architecture: Hadoop and the Spark that handles in real time for batch processing；The data have capacity, speed and multifarious It when challenge, needs to save the data in the distributed structure/architecture of different location, relies on distributed treatment, the distributed data of cloud computing The technology relevant to hardware system such as library, cloud storage and virtualization overcomes mass data (especially big data) to be difficult to store and nothing The problem of method is calculated with human brain, estimates or is handled with single computer；Need to consider in earnest it is abstract with performance it Between equilibrium relation, and optimize for its specific business demand；The performance of storage aspect also relies on used inquiry The mode that engine, storage architecture and data are saved.

5. a kind of method for effectively playing data value as claimed in claim 4, it is characterised in that the reality of the data service It is existing, be assist or substitute using framework (tool or platform) mankind complete data storage, management and analysis, using etc. differences The efficient process of level, the multi-stage data service (especially big data technological service) for effectively using, playing the work such as value, can be with Respectively as the basis of higher level data service, can according to application demand combined application, can also by successively lay in Reduce the boundary between each level, specifically include that data processing and its application based on infrastructure service, with big data analysis, number According to the middle rank service based on excavation, the high-level service and more advanced service of data depth application；Infrastructure service mainly has: number Data preprocess, data processing, data management and its application and laid in for more advanced data service；Data prediction is Refer to some applications need to carry out related data general arithmetical operation and working process for example data acquisition, data conversion, Data grouping, data organization, data calculating, data storage, data retrieval or data sorting and etc.；Data processing is main right Various forms of data carry out the process of processing and sorting, comprising collection, storage, processing, classification, merger, calculating, sequence, conversion, Retrieval and propagate etc. differentiation and derive overall process；It calculates and is generally straightforward in treatment process, and process and calculate because of business It is different and different, it needs to be solved according to the needs of business to write application program；Data management is the base in data processing Data are effectively compiled on plinth, are organized, are stored, are safeguarded, are retrieved, are transmitted, are handled and the behaviour such as Data Security The process of work, it is therefore intended that sufficiently effectively to do comprehensive guarantee using data, the key for realizing that data effectively manage is data Tissue, best mode is general, comprehensive using one, easily extend, easy to use and efficient management software or is System framework；The all problems of infrastructure service based on data processing and its application can pass through (or the general life of the Hadoop ecosphere State circle) effective storage, quickly the use or combination of the tools such as calculatings, quick search, random challenge, correlation inquiry, arrange in pairs or groups Using solving and laid in for more advanced data service.

6. a kind of method for effectively playing data value as claimed in claim 5, it is characterised in that the stage data service Mainly there are data analysis, big data analysis, statistical technique, data mining, big data visualization etc., Spark and Hadoop can Enough meet the stage data service based on big data analysis, data mining and laid in for more advanced data service, only It is to be applicable in scene to slightly have difference；Data analysis has many levels: descriptive analysis includes average value, standard deviation, and year-on-year, ring is than hair Open up speed, quantile, mode etc.；Mathematical Statistics Analysis includes Sampling Estimation, it is assumed that is examined, variance analysis etc.；Big data analysis It is required that machine has analysis ability, mainly excavated in, the numerous data of structure huge from the scale of construction using machine learning algorithm Hiding rule, so that data be made to play maximized value；Can also from mass data, find be worth reference template or Rule is converted into valuable information, sees clearly or knowledge, creates more new values；Machine learning algorithm can be with automatic and can The mode of extension carries out seeing clearly analysis to collected large-scale, multidimensional data, broadly refers to computer with automatic to data It carries out pattern learning and obtains the ability of inference, algorithm classification can be carried out according to many different dimensions；The big data point Data involved in analysing often also are not only that data volume is huge and multidimensional, machine learning algorithm must also pay close attention to conventional statistics formula The computational problem ignored；Data mining is to deduce, predict, value -capture to complicated, non-structured data set, is by dividing It analyses each data and finds its rule from mass data, mainly there is data preparation, rule searching and rule to indicate and etc., very To may include visualization technique due to reasoning and prediction；Need to use a large amount of machine learning, data analysis and data pipe Manage the technology of aspect；Machine learning algorithm is used as extracting the tool of the potential valuable mode in data set；Main method There is a clustering, classification analysis (decision tree, neural network, support vector machine, random forest), correlation rule, collaborative filtering is different Often analysis, special cohort analysis and evolution analysis etc.；Data mining have the characteristics that it is multifarious, to time series data, flow data, sequence The data classification of the multiplicity such as column data, diagram data, spatial data, multi-medium data is easy to be mapped to the different vertical segmented industry Specific application scene, all kinds of machine learning algorithms can be also adapted in many ways and/or be reinforced a variety of to be applied to On non-structured data set；Both data mining and data analysis are closely coupled, the relationship with circular recursion, data analysis As a result need further progress data mining that could instruct decision, and the process that data mining carries out value assessment is also required to adjust It is prior-constrained and carry out data analysis again；Data analysis is the tool for data being become information, and data mining is that information is become At the tool of cognition, extract that certain rule (recognize) generally requires data analysis and data mining combines makes from data With；A kind of very important method of data mining be classification, be on the basis of data with existing learn a classification function or A disaggregated model (classifier) is constructed, the data recording in database is mapped to some in given classification, thus It is predicted applied to data；There are several steps: establishing the classifier of training set, assesses the predictablity rate of classifier, then to new number It is predicted that its class label；Classifier includes decision tree, logistic regression, naive Bayesian, neural network scheduling algorithm；Prediction can relate to And data value prediction and class label are predicted, but prediction is often referred to value prediction；Classification is the class label for prediction data object, in advance Survey is the certain vacancies of estimation or unknown-value；There are many situations for big data visualization: space layout visualization be abstracted/summarizes visual Change, interactive mode/real-time visual；Multiple portions can be subdivided by recommending the process of class application: being understood business, analyzed data, number According to pretreatment, feature extraction, model construction, model evaluation, model deployment, application, assessment application effect etc., whole process needs Will constantly repeatedly, model is also required to constantly adjust, until reaching ideal effect；In business data application, to service value chain Key link carries out the data analysis of science, can help to promote insight, establish the competitive advantage of differentiation；Enterprise's first choice Big data application is the big data marketing based on customer behavior analysis, secondly products innovation, risk profile, supply chain management, visitor Family service etc..

7. a kind of method for effectively playing data value as claimed in claim 5, it is characterised in that the Data Security There are many aspects, such as system safety, technical security, operation safety, storage safety, transmission safety, products & services safety；When System has a decentralization, data have it is irreversible, can not distort with safety requirements such as anonymities in the case where, block chain can be used Technology；The safety of big data and privacy challenge can decompose the architecture safety of the big data ecosystem, data-privacy, number According in management, integrality and reaction equation safety etc.；Therefore safety and privacy are also applied in every level one data service, need Safety problem brought by data under a variety of situations is specifically handled.

8. the as claimed in claim 5 a kind of method for effectively playing data value, it is characterised in that the high-level service includes Nerual network technique, especially deep learning technology in terms of big data application and stored up for more advanced data service It is standby；Big data makes that a computer system is allowed to have an opportunity to become more and more intelligent there are enough data (empirical data), makes Machine more understands people, this is properly termed as big data thinking: neural network is trained by data come data are the passes of training Key, training pattern can by machine learning extend, including Ontological concept extension and its between relationship, be semantic analysis Basis and one of decision support, planning application, the basis of corporate resources scheduling；Sound, symbol, text, image, video etc. Identification, information conversion and the effective use of data, by nerual network technique, especially deep neural network (DNN) technology, energy It realizes relevant data service, preferably plays value；SpeeDO can be used to implement the high-level data clothes of profound application Business, provides the advanced data services based on neural network, the especially artificial intelligence technologys such as deep neural network (deep learning) And it is laid in for more advanced data service；Deep neural network needs to carry out large-scale data training, nothing on computers Supervised learning is comparatively ideal mode, but data volume and meter involved in big data+deep neural network machine learning model Calculation amount is very big, and model training needs to accelerate to reduce the time by parallel computation in distributed system；Distributed system Design needs the collaboration of DNN algorithm expert and systems specialists, may should modify algorithm and be allowed to match with bottom hardware framework, again The single machine device for needing systems specialists design computing capability powerful designs the server of high density integration, efficient communication again, brings Challenge have many levels: algorithm, application and Hardware Design, need from realize accelerate, hardware system is built and is advised greatly Mould application etc. solves simultaneously；Embodiment based on DNN can first build the hardware of data (especially big data) dependence System, high-rise subsystem, then realize DNN model or algorithm in software view finally realize that different scales, different scenes are answered With such as the identification of voice and image.

9. the method that one kind as described in claim 4 or 8 effectively plays data value, it is characterised in that the hardware system Design can have multiple directions, both can be real using the realization based on cpu server (cluster) or the system based on CPU+GPU It is existing, the solution based on APU can also be used, the exploration and research that can also be carried out from more bottom hardware；The APU can be with The ideal hardware environment of a high density low-power consumption is provided, the insufficient memory that can solve CPU+GPU cluster in big data is used With the too high bottleneck problem of node communication cost；Learn the thinking side of this machine image identification based on the deep neural network Method, it is proposed that a kind of machine evolvement method, the support that the method for mind over machine needs machine hardware to evolve, it is believed that this is Hardware system has the ability of artificial intelligence.

10. the method that one kind as described in claim 5,6 or 8 any one effectively plays data value, it is characterised in that institute State data it is enough when, computer can draw a conclusion in the case where not knowing about logic of questions, specific causality and mention For more structurally sound as a result, this is to give up causality, with the mode of thinking of correlativity, and core is widely pre- It surveys；When the data are increasing, data accumulation process from quantitative change to qualitative change forms data-driven, and data itself can guarantee The validity for analyzing result, obtains the conclusion closer to the fact, therefore data become new productivity；The data, with the mankind Related major part can be used for solving the problems, such as people, people-oriented personal data center such as be established, so that related with people Respective services are improved；The data, the core value of especially big data are to create, and are that filling up numerous is also not implemented The blank crossed, value content, application cost are much more important than quantity；The data, especially big data industry realize the pass of profit Key is to data " working ability ", and " increment " of data may be implemented in " processing ", such as passes through the big number of intersection, comprehensive every profession and trade According to, it is possible to create new insight；The data are increasing, and data are enough, structure more diversification, every profession and trade data energy When enough intersections, comprehensive use, hiding creative be worth of big data itself can be bigger, needs to obtain better play to mention For more advanced data service: in addition to trade marketing problem, people-oriented big data service, big data can preferably be applied to Matter of science and technology and all kinds of social concerns of solution, make data play more extensive, bigger value；The data can support Smart city, wisdom traffic, the wisdom energy, intelligent medical treatment, wisdom environmental protection, intelligence manufacture, robot, wisdom business, intelligent family The theory in residence etc. needs, these " wisdom " application will become the main source and service range of big data.