WO2000079415A9 - Segmentation et traitement de flux de donnees continues au moyen de la semantique transactionnelle - Google Patents
Segmentation et traitement de flux de donnees continues au moyen de la semantique transactionnelleInfo
- Publication number
- WO2000079415A9 WO2000079415A9 PCT/US2000/016839 US0016839W WO0079415A9 WO 2000079415 A9 WO2000079415 A9 WO 2000079415A9 US 0016839 W US0016839 W US 0016839W WO 0079415 A9 WO0079415 A9 WO 0079415A9
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- processing
- continuous stream
- segment
- transactional semantics
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/149—Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
Definitions
- Computer-based transaction systems generate data relating to transactions performed using those systems. These data relating to transactions are analyzed to identify characteristics of the transactions. From these characteristics, modifications to the transactions and/or associated marketing may be suggested, or other business decisions may be made.
- Computer systems for analyzing data relating to transactions generally access the data stored in a database. After the data has been collected for some period of time, the collected data is added to the database in a single transaction. As discussed, data stored in the database is analyzed and results are produced. The results obtained from the analysis typically represent an aggregation of the data stored in the database. These results are then used, for example, as the basis for various business decisions and are also often stored in a database.
- the raw data relating to transactions are not retained in the database after they are processed.
- processing of data relating to transactions generally is a form of batch processing.
- results are not output until all the data is processed. If, for example, each record associated with a batch were stored in the database in a separate transaction, a significant amount of overhead would be incurred by a database management system associated with the database.
- a large volume of data is read from the database in a single transaction to permit analysis on the data.
- the time between a transaction occurring and the generation of results using data about the transaction may be days or even weeks.
- the transaction data may be segmented and processed in a data flow arrangement, optionally in parallel, and the data may be processed without storing the data in an intermediate database. Because data is segmented and operated on separately, data from multiple sources may be processed in parallel.
- the segmentation may also define points at which aggregate outputs may be provided, and where checkpoints may be established. By partitioning data into segments and by defining checkpoints based upon the segmentation, a process may be restarted at each defined checkpoint. In this manner, processing of data may fail for a particular segment without affecting processing of another segment. Thus, if processing of data of the particular segment fails, work corresponding to that segment is lost, but not work performed on other segments.
- This checkpointing may be implemented in, for example, a relational database system.
- Checkpointing would enable the relational database system to implement restartable queries, and thus database performance is increased. This is beneficial for database vendors and users whose success relies on their systems' performance.
- checkpoint processing and recovery can be performed.
- the method comprises steps of receiving an indication of transactional semantics, applying the transactional semantics to the continuous stream of data to identify segments of the continuous stream of data, processing the data in each segment of the continuous stream of data to produce results for the segment, and after the data of each segment of the continuous stream of data is processed, providing the results produced for that segment.
- the data includes a plurality of records, each record includes a plurality of fields, and the transactional semantics are defined by a function of one or more fields of one or more records of the data.
- the method further comprises a step of partitioning the continuous stream of data according to the identified segments.
- the step of partitioning includes a step of inserting a record in the continuous stream of data indicating a boundary between two segments.
- the record is a marker record indicating only a boundary.
- the record is a semantic record including information related to the transactional semantics.
- the continuous stream of data is a log of information about requests issued to a server, and the step of applying comprises steps of reading information relating to a request from the log; and applying the transactional semantics to the read information.
- the information relating to each request includes a plurality of fields, and wherein the transactional semantics are defined by a function of one or more fields of information relating to one or more requests.
- the information includes a time at which the request was issued to the server and wherein the transactional semantics define a period of time.
- the method further comprises a step of filtering the log to eliminate information relating to one or more requests.
- the step of filtering is performed prior to the step of applying the transactional semantics.
- the step of filtering includes a step of eliminating information relating to requests associated with spiders.
- the method further comprises a step of filtering the continuous stream of data to eliminate data from the continuous stream of data.
- the method further comprises an additional step of processing the data in each segment of the continuous stream of data to produce the results for the segment, and after the data of each segment of the continuous stream of data is processed during the additional step of processing, providing the results produced for that segment.
- the step of processing comprises steps of partitioning data in each segment as a plurality of parallel partitions; and processing each of the partitions in parallel to provide intermediate results for each partition.
- the method further comprises a step of combining intermediate results of each partition to produce the results for the segment.
- the data in the continuous stream of data has a sequence, and there are multiple sources of the continuous stream of data, and the method further comprises dete ⁇ riining whether data in the continuous stream of data is in sequence; and if the data is determined to be out of sequence, interrupting the step of processing, inserting the data in a segment according to the transactional semantics, and reprocessing the segment and continuing the step of processing.
- method further comprises saving a persistent indication of the segment for which data is being processed; when a failure in the step of processing is detected, discarding any results produced by the step of processing for the selected segment and reprocessing the selected segment corresponding to the saved persistent indication; and when the step of processing completes without failure, providing the outputs produced as an output and selecting the next segment.
- a processes for checkpointing operations on a continuous stream of data by a processing element in a computer system.
- the process comprises steps of receiving an indication of transactional semantics; applying the transactional semantics to the data to partition the continuous stream of data into segments for processing by the processing element; selecting one of the segments; saving a persistent indication of the selected segment; processing the selected segment by the processing element to produce results; when a failure of the processing element is detected, discarding any results generated by the processing element for the selected segment and reprocessing the selected segment corresponding to the saved persistent indication; and when processing by the processing element completes without failure, providing the outputs produced by the processing element as an output and selecting the next segment to be processed by the processing element.
- the step of applying includes inserting data in the continuous stream of data indicating boundaries between segments of the data.
- a computer system for checkpointing operations on a continuous stream of data in a computer system.
- the computer system comprises means for receiving an indication of transactional semantics; means for applying the transactional semantics to the continuous stream of data to partition the data into segments; means for selecting one of the segments; means for saving a persistent indication of the selected segment; a processing element for processing the selected segment to produce results; means, operative after a failure of the processing element is detected for discarding any outputs generated by the processing element for the selected segment and means for directing the processing element to reprocess the selected segment corresponding to the saved persistent indication; and means, operative after processing by the processing element completes without failure, for providing the results produced by the processing element and selecting the next segment to be processed by the processing element.
- the means for applying includes inserting data in the continuous stream of data indicating boundaries between segments of the data.
- a method for processing a continuous stream of data.
- the method comprises receiving an indication of transactional semantics; applying the transactional semantics to the continuous stream of data to identify segments of the continuous stream of data; and inserting data in the continuous stream of data indicating boundaries between the identified segments of the continuous stream of data.
- Figure 1 is a dataflow diagram showing a system which processes continuous data according to one embodiment of the invention
- Figure 2 is a flowchart describing operation of how data may be imported from a continuous source of data into a parallel application framework
- Figure 3 is an alternative dataflow diagram showing a system which processes multiple data streams
- Figure 4 is a flowchart describing how data may be processed by a multiple pipeline system
- FIG. 5 is a block diagram of a client-server system suitable for implementing various embodiments of the invention.
- Figure 6 is a block diagram of a processing architecture used to process data
- Figure 7 is a block diagram of a two-node system having operators that communicate in a parallel manner.
- a continuous data source 101 provides a continuous stream 102 of data that is processed by the data processing application 107 to provide results 108 according to some transactional semantics 103.
- These transactional semantics 103 may be information that determine how stream 102 should be segmented. Semantics 103, for example, may depend upon some requirement of the system in operating on stream 102 or depend on a business requirement for analyzing data.
- the data is segmented by a segmenter 104 according to the transactional semantics 103 to provide segmented data 105.
- a data processing operator 106 processes data vvithin each segment of the segmented data 105 to provide results 108 for each segment. These processes may be, for example, reading or updating of one or more portions of data in the continuous data stream 102.
- the continuous data source 101 generally provides data relating to transactions from a transaction system.
- the source is continuous because the transaction system generally is in operation for a period of time to permit users to make transactions.
- the continuous data source may be a Web server that outputs a log of information about requests issued to the Web server. These requests may be stored by the Web server as log records into a server log.
- Other example sources of continuous streams of data include sources of data about transactions from a reservation system, point-of-sale system, automatic teller machine, banking system, credit card system, search engine, video or audio distribution systems, or other types of systems that generate continuous streams of data.
- the data relating to a transaction generally includes a record for each transaction, including one or more fields of information describing the transaction.
- the record may be in any of several different formats.
- Data relating to a transaction for example, may have a variable or fixed length, may be tagged or untagged, or may be delimited or not delimited.
- Data relating to a transaction may be, for example, in a markup language format such as SGML, HTML, XML, or other markup language.
- Example constructs for communicating the data from the continuous data source 101 to the data processing application 107 include a character string, array or other structure stored in a file, a database record, a named pipe, a network packet, frame, cell, or other format.
- the continuous stream 101 of data is a server log
- example data relating to a transaction may include a user identifier, a client program and/or system identifier, a time stamp, a page or advertisement identifier, an indicator of how the page or advertisement was accessed, a record type, and/or other information regarding the transaction.
- Transactional semantics 103 define a function of one or more fields of one or more records of the continuous stream of data 102.
- transactional semantics 103 may define a period of time, e.g., an hour, so that all data within a one hour period are placed in one segment.
- Transactional semantics 103 may also define an aggregate function of several records, e.g., a total volume of sales, rather than a function of a single record, such as a time.
- Such transactional semantics 103 may also be derived from business rules indicating information to be obtained from analysis of the data.
- Transactional semantics 103 may also depend on some system requirement. This analysis may be performed, for example, on a per-segment basis to enable a business decision.
- Transactional semantics 103 are applied by the segmenter 104 to the continuous stream of data 102 to identify segments in the continuous stream of data 102.
- the continuous stream of data 102 may be partitioned according to these identified segments in many ways.
- a record may be inserted into continuous stream of data 102 indicating a boundary between two segments in the stream of data.
- the record may be a marker record that indicates only a boundary.
- a tag may be placed in all records such that a marker record has one value for the tag and data records have another value for the tag.
- the record may be a semantic record that includes information related to the transactional semantics, such as the transactional semantics themselves or some information derived by applying the transactional segments to the data, such as a specification of a period of time.
- application 107 may permit multiple data processing operators 106 to access the data segments according to the transactional semantics stored in the data. Any type of information may be used to indicate partitions in stream of data 102.
- Multiple segmenters 104 also may be used to produce different segmented continuous streams of data 105 for which different processing may be performed.
- multiple data processing operators 106 may be used in parallel to perform different analyses on the segmented continuous stream of data 105.
- data processing operator 106 There are many kinds of operations that may be performed by data processing operator 106. For example, data aggregations such as counts of records, sums of variables within the records, and statistical values such as the mean, maximum and minimum of various data fields may be computed for each data segment. In an application where the continuous stream of data is a server log, it is possible to compute a unique number of users, for example, to whom each item of information has been provided by the server in each segment, or in a combination of segments. Various data processing operators 106 may be added or deleted from the data processing application 107 to provide a variety of different results 108.
- data processing operators 106 may be added or deleted from the data processing application 107 to provide a variety of different results 108.
- Data processing application 107 may be implemented using the Orchestrate parallel framework from Torrent Systems, Inc. such as described in U.S. Patent Application Serial No. 08/627,801, filed March 25, 1996 and entitled “Apparatuses and Methods for Programmable Parallel Computers," by Michael J. Beckerle et al.; U.S. Patent Application Serial No. 08/807,040, filed February 24, 1997 and entitled “Apparatuses and Methods for Monitoring Performance of Parallel Computing,” by Allen M. Razdow, et al.; and U.S. Patent Application Serial No. 09/104,288, filed June 24, 1998 and entitled “Computer System and Process for Checkpointing Operations on Data in a Computer System by Partitioning the Data," by Michael J. Beckerle, and as described in U.S.
- Patent Number 5,909,681 issued June 1, 1999 and entitled "A Computer System and Computerized Method for Partitioning Data for Parallel Processing," by Anthony Passera, et al.
- parallel data sources are processed in a data flow arrangement on multiple processors.
- each operation to be performed in Figure 1 such as segmentation or data analysis, may be implemented as an operator in the Orchestrate parallel processing framework.
- data processed by the data processing operator is partitioned into a plurality of parallel partitions.
- Each of these parallel partitions is processed in parallel by a different instance of the data processing operator, each of which provides intermediate results for its respective partition.
- These intermediate results may be combined to provide aggregate results for the segment by an operator that performs an aggregation function.
- Data processing operator 106 may be implemented in several ways. In particular, data processing operator generally 106 may process data either in a batch mode or continuous mode. If the data processing operator 106 performs batch processing, it does not output data until all of the data associated with the batch entry has been processed. Operator 106 may be controlled by a program executing a continuous loop that provides the data to the operator on a per segment basis.
- Segmented continuous stream of data 105 in any of its various forms, also may be stored as a parallel data set in the Orchestrate parallel framework.
- a parallel data set generally includes a name, pointers to where data actually is stored in a persistent form, a schema, and metadata (data regarding data) defining information such as configuration information of hardware, disks, central processing units, etc., indicating where the data is stored.
- One data set may be used to represent multiple segments, or separate data sets may be used for each segment.
- the continuous stream of data 102 may be imported into a data set in the application framework from the form of storage, in which the continuous data source 101 produces the continuous stream of data 102.
- the continuous data source 101 may be an HTTPD server that produces data regarding requests received by the HTTPD server, and the server saves this data into a log.
- a separate application commonly called a log manager, periodically creates a new log file into which the HTTPD server writes data.
- a new log file may be created for each day.
- Information regarding how the log manager creates log files is provided to a data processing operator 106 such as an import operator which reads the set of log files as a continuous stream of data into a data set in the Orchestrate application framework.
- a data processing operator 106 such as an import operator which reads the set of log files as a continuous stream of data into a data set in the Orchestrate application framework.
- multiple HTTPD servers may write to the same log file in parallel. That is, the multiple HTTPD processes generate parallel streams of data which are processed by one or more input operators.
- a multiple input operator may be used to combine these data stream into a single data stream, upon which additional operators may operate.
- the importation process 200 relies on source identification information received in step 201.
- This identification information identifies a naming convention for data files, named pipes or other constructs used by the continuous source of data 102.
- a named construct is then selected in step 202 according to the received source identification information.
- Any next data record is read from the named construct in step 203.
- a verification step also may be performed to verify that the correct named construct was accessed if the construct contains identification information. If the read operation performed in step 203 returns data, as determined in step 204, the data is provided to the next operator in step 208.
- the next operator may be a filtering operation, an operation that transforms the data record into another format more suitable for segmentation and processing, or may be the segmenter.
- Processing continues by reading more data in step 203.
- the importer continuously reads data from the specified continuous data source, providing some buffering between the continuous data source and the data processing application. If data is not available when the read operation is performed, as determined in step 204, it is first determined whether the server is operating in step 205. If the server is not operating in step 205, the system waits in step 209 and attempts reading data again in step 203 after waiting.
- the waiting period may be, for example, random, a predetermined number, or combination thereof.
- step 206 the transaction system may be presumed to be operating normally and merely has not been used to produce data relating to a transaction.
- the importer process 200 may wait for some period of time and/or may issue a dummy record to the next operator, as indicated in step 210, before attempting to read data again in step 203. If the end of file is reached, as determined in step 206, the next file (or other named construct) is selected in step 207 according to the source identification information, after which, processing returns to step 203.
- This process 200 may be designed to operate without interruption to provide data continuously to data processing application 107.
- Segmentation of the continuous stream of data 102 also provides a facility through which checkpointing of operations generally may be performed.
- a persistent indication of a segment being processed may be saved by an operator 106.
- any results produced by the operator 106 for the selected segment may be discarded.
- the segment then may be reprocessed using the saved persistent indication of the segment being processed. If operator 106 completes processing without failure, the outputs produced by operator 106 may be output before the next segment is processed.
- the segmentation can be used to define partitions for checkpointing, that may be performed in the manner described in "Loading Databases Using Dataflow Parallelism," Sigmod Record, Vol. 23, No. 4, pages 72-83, December 1994, and in U.S. Patent Application Serial No. 09/104,288, filed June 24, 1998 and entitled “Computer System and Process for Checkpointing Operations on Data in a Computer System by Partitioning the Data,” by Michael J. Beckerle. Checkpointing also may be performed using a different partitioning than the segmentation based on transactional semantics.
- the importation operation described above in connection with Figure 2 and the segmenter may be implemented as a composite operator to enable checkpointing of the entire data processing application from the importation of the continuous stream of data to the output of results.
- Checkpointing of the import process also may be performed according to the transactional semantics. For example, if a time field is used, the entire step can be checkpointed on a periodic basis, such as one hour, thirty minutes, etc.
- the continuous source of the data may be interrupted, for example due to failures or for other reasons, and may provide data out of an expected sequence.
- the out-of-sequence data may be discarded.
- the out-of-sequence data may be useful.
- the out- of-sequence data are identified and inserted into the appropriate segment and that segment is reprocessed.
- Out-of-sequence data may be detected, for example, by monitoring the status of the continuous source of data 101. When a source of data 101 becomes available, after having been previously unavailable, processing of other segments is interrupted, and the out-of-sequence data from the newly available source is processed.
- data processing application 107 may be configured to process multiple continuous data streams 102 in a parallel manner.
- Figure 3 shows a data processing application 308, similar in function to data processing application 107, that receives parallel continuous data streams 305-307 from a number of different data sources 302-304.
- Data processing application 308 is configured to operate on these individual streams 305-307 and to provide one or more results 310.
- results 310 may be for example a consolidated stream of data as a function of input streams 305- 307.
- results 310 may be a real-time stream of records that may be stored in a database.
- the database is a relational database, and the relational database may be capable of performing parallel accesses to records in the database.
- System 301 as shown in Figure 3 is an example system that processes multiple parallel data sources.
- these sources may be HTTPD servers that generate streams of log file data.
- log file information must be consolidated from the multiple sources and then processed in a serial manner, or multiple processes must individually process the separate streams of data.
- throughput decreases because a sequential bottleneck has been introduced.
- a programmer explicitly manages the separate parallel processes that process the individual streams and consolidates individual stream data.
- System 301 may support multiple dimensions of parallelism.
- system 301 may operate on partitions of a data stream in parallel.
- system 301 may operate on one or more streams of data using a parallel pipeline.
- segmenter 104 may accept one or more continuous data streams 102 and operate upon those in parallel, and there may be a number of data processing operators 24 that operate on the discrete streams of data.
- Figure 4 shows a dataflow wherein multiple continuous data sources produce multiple continuous data streams, respectively.
- process 400 begins.
- system 301 may import a plurality of log files. These import processes may occur in parallel, and the results of these import processes may be passed off to one or more data processing operators 106 which perform processing upon the log files at steps 405-407. Although three data streams are shown, system 301 may process any number of parallel data streams, and include any number of parallel pipelines. The results of these import processes may repartition the data stream and reallocate different portions of the data stream to different data processing operators 106.
- log files are processed in a parallel manner, typically by different threads of execution of a processor of system 301. Processings that may be performed may include sort or merge operations to elements of the input data stream. These sort and merge processes may be capable of associating like data, or otherwise reorganizing data according to semantics 103 or predefined rules.
- each stream is processed, respectively by, for instance, a data processing operator 106. These data operators may perform functions including data detection, cleansing, and augmentation. Because input data streams can contain bad data, system 301 may be capable of detecting and rejecting this data. This detection may be based upon specific byte patterns that indicate the beginning of a valid record within the data stream, or other error detection and correction facilities as is known in the art.
- one or more portions of an incoming data stream may be "cleansed.”
- items in the data stream may be augmented with other information.
- Web site activity may be merged with data from other transactional sources in real-time, such as from sales departments, merchandise, and customer support to build one-to-one marketing applications. Therefore, system 301 may be capable of augmenting data streams based on, for instance, in-memory table lookups and database lookups. For example, augmenting a data stream with all the advertisers associated with a given ad would allow a user to perform a detailed analysis of advertising revenue per ad. Other types of data augmentation could be performed.
- data for multiple streams may be aggregated.
- system 301 may provide several grouping operators that analyze and consolidate data from multiple streams. This provides, for example, the ability to analyze Web activity by efficiently grouping and analyzing data across several independent dimensions. More particularly, information needed to provide an accurate assessment of data may require an analysis of data from multiple sources.
- aggregated stream data is stored in one or more locations.
- data may be aggregated and stored in relational database.
- system 301 may store information in a parallel manner in the relational database.
- System 301 may be implemented, for example as a program that executes on one or more computer systems.
- These computer systems may be, for example, general purpose computer systems as is known in the art. More particularly, a general purpose computer includes a processor, memory, storage devices, and input output devices as is well-known in the art.
- the general purpose computer system may execute an operating system upon which one or more systems may be designed using a computer programming language.
- Example operating systems include the Windows 95, 98, or Windows NT operating systems available from the Microsoft Corporation, Solaris, HPUX, Linux, or other Unix-based operating system available from Sun Microsystems, Hewlett-Packard, Red Hat Computing, and a variety of providers, respectively, or any other operating system that is known now or in the future.
- Figure 5 shows a number of general purpose computers functioning as a client 501 and server 503.
- data processing application 107 may function as one or more processes executing on server 503.
- server program 510 which performs one or more operations on a continuous data stream 102.
- server 503 includes an object framework 509 which serves as an applications prograrnming interface that may be used by a programmer to control processing of server program 510.
- Client 501 may include a management application 505, through which a user performs input and output 502 to perform management functions of the server program 510.
- Management application 505 may include a graphical user interface 506 that is configured to display and accept configuration data which determines how server program 510 operates.
- Management application 505 may also include an underlying client program 507 which manages user information and provides the user information to server program 510. Communication between client 501 and server 503 is performed through client communication 508 and server communication 511 over network 504.
- Client and server communication 508, 511 may include, for example, a networking protocol such as TCP/IP, and network 504 may be an Ethernet, ISDN, ADSL, or any other type of network used to communicate information between systems.
- Client-server and network communication is well-known in the art of computers and networking.
- Server 503 may, for example, store results 108 into one or more databases 512 associated with server 503. In one embodiment, database 512 is a parallel relational database. Server 503 may also store a number of user configuration files 513 which describe how server program 510 is to operate.
- data processing application 107 may be a client-server based architecture.
- This architecture may be designed in one or more programming languages, including JAVA, C++, and other programming languages.
- data processing application 107 is programmed in C++, and a C++ framework is defined that includes components or objects for processing data of data streams. These objects may be part of object framework 509. For example, there may be components for splitting, merging, joining, filtering, and copying data.
- Server program 510 manages execution of the data processing application 107 according to the user configuration file 513.
- This configuration file 513 describes underlying computer system resources, such as network names of processing nodes and computer system resources such as disk space and memory.
- the database 512 may be used to store related application information, such as metadata including schemas that describe data layouts, and any user-defined components and programs.
- Figure 6 shows an architecture 601 framework through which data processing application 107 may be implemented.
- architecture 601 may include a conductor process 602 which is responsible for creating single program behavior.
- process 602 establishes instance of the data processing application 107.
- Conductor process 602 may also spawn section leader processes 603 and 604.
- the conductor process 602 spawns section leader processes 603 and 604 at the same of different systems, using the well known Unix command "rsh" which executes remote commands.
- section leader processes are spawned one per physical computer system.
- Each section leader process 603-604 spawns player processes, one for each data processing operator 106 in the data flow, via the well-known fork() command.
- the conductor may be for example, executed on the same or separate computers as the section leader and/or player processes 605-610.
- Conductor process 602 communicates with section leader processes 603-604 by sending control information and receiving status messages along connections 611, 612, respectively.
- the section leader processes 603-604 likewise communicate to player processes 605-610 by issuing control information and receiving status and error messages.
- the conductor process 602 consolidates message traffic and ensures smooth program operation.
- section leader processes 603-604 help program operation, terminating their controlled player processes and notifying other section leaders to do the same.
- Data processing application 107 may have associated with it, an I/O manager for managing data throughout the framework. I/O manager may, for instance, communicate with the conductor process (or operator) to handle data flow throughout the architecture, and may communicate information to a data manager that is responsible for storing result data. I/O manager may provide one or more of the following functions:
- the I/O manager may provide a port interface to the data manager.
- a port may represent a logical connection.
- Ports may be, for example, an input port (an "inport") or an output port (an “outport”), and may be virtual or physical entities.
- An outport represents a single outbound stream and is created for each output partition of a persistent dataset.
- the process manager (conductor) creates connections between player processes.
- any virtual output port of a particular player process can have a single connection to a downstream player process.
- inports represent a single inbound stream, and one input port may be created for each inbound data stream.
- Inbound data streams for input virtual ports may be merged non-deterministically into a single stream of data blocks. Ordering of data blocks may be preserved in a given partition, but there may be no implied ordering among partitions. Because there is no implied ordering among partitions, deadlock situations may be avoided.
- Figure 7 shows a series of logical connections that may be established between two nodes 1 and 2, each having separate instances of operators A and B.
- node 1 includes a player operator (or process) A 701 and a player operator B 702, wherein operator A provides data in a serial manner to operator B for processing.
- operator A 703 of node 2 may also provide information in a serial manner to player operator B 702 of node 1.
- player operator A 701 may provide data for processing by payer operator B 704 of node 2.
- One or more logical connections setup between operators 701-704 may facilitate this data transfer. In this manner, communication between parallel pipelined processes may occur.
- the data may be filtered to eliminate records that do not assist in or that may bias or otherwise impact the analysis of the data.
- the continuous stream of data is a log of information to eliminate information about requests issued to a server
- the log may be filtered about one or more requests.
- the kinds of information that may be eliminated include information about requests associated with various entities including computer programs called “spiders,” “crawlers” or “robots.” Such a program executed by a search engine to access file servers on a computer network and gather files from them for indexing. These requests issued by spiders, crawlers and robots also are logged in the same manner as other requests to a server. These programs have host names and agent names which may be known. The filtering operation may filter any requests from users having names of known spiders, crawlers or robots. A server also may have a file with a predetermined name that specifies what files may be accessed on the server by spiders, crawlers and robots.
- Accesses to these files may be used to identify the host or agent name of spiders, crawlers or robots, which then may be used to filter out other accesses from these entities.
- Programs are readily available to detect such spiders, crawlers and robots. Further, elimination of duplicate data records or other data cleaning operations may be appropriate. Such filtering generally would be performed prior to applying the transactional semantics to segment the continuous stream of data, but may be performed after the data is segmented.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Apparatus For Radiation Diagnosis (AREA)
Abstract
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2001505311A JP4600847B2 (ja) | 1999-06-18 | 2000-06-19 | トランザクションの意味規則を用いた連続データストリームのセグメント化および処理 |
AU56247/00A AU5624700A (en) | 1999-06-18 | 2000-06-19 | Segmentation and processing of continuous data streams using transactional semantics |
EP00941551A EP1314100A2 (fr) | 1999-06-18 | 2000-06-19 | Segmentation et traitement de flux de donnees continues au moyen de la semantique transactionnelle |
KR1020017016276A KR20020041337A (ko) | 1999-06-18 | 2000-06-19 | 트랜잭션 시맨틱스를 이용한 연속 데이터 스트림의세그먼테이션 및 처리 |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14000599P | 1999-06-18 | 1999-06-18 | |
US60/140,005 | 1999-06-18 | ||
US18566500P | 2000-02-29 | 2000-02-29 | |
US60/185,665 | 2000-02-29 |
Publications (4)
Publication Number | Publication Date |
---|---|
WO2000079415A2 WO2000079415A2 (fr) | 2000-12-28 |
WO2000079415A8 WO2000079415A8 (fr) | 2001-04-05 |
WO2000079415A9 true WO2000079415A9 (fr) | 2002-06-13 |
WO2000079415A3 WO2000079415A3 (fr) | 2003-02-27 |
Family
ID=26837781
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2000/016839 WO2000079415A2 (fr) | 1999-06-18 | 2000-06-19 | Segmentation et traitement de flux de donnees continues au moyen de la semantique transactionnelle |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP1314100A2 (fr) |
JP (1) | JP4600847B2 (fr) |
KR (1) | KR20020041337A (fr) |
CN (1) | CN100375088C (fr) |
AU (1) | AU5624700A (fr) |
WO (1) | WO2000079415A2 (fr) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7136912B2 (en) | 2001-02-08 | 2006-11-14 | Solid Information Technology Oy | Method and system for data management |
ATE515746T1 (de) | 2003-09-15 | 2011-07-15 | Ab Initio Technology Llc | Datenprofilierung |
US20050097565A1 (en) * | 2003-10-31 | 2005-05-05 | Udo Klein | Gathering message information |
US7571153B2 (en) * | 2005-03-28 | 2009-08-04 | Microsoft Corporation | Systems and methods for performing streaming checks on data format for UDTs |
US7400271B2 (en) * | 2005-06-21 | 2008-07-15 | International Characters, Inc. | Method and apparatus for processing character streams |
US7937344B2 (en) | 2005-07-25 | 2011-05-03 | Splunk Inc. | Machine data web |
CN102831214B (zh) | 2006-10-05 | 2017-05-10 | 斯普兰克公司 | 时间序列搜索引擎 |
US8688622B2 (en) * | 2008-06-02 | 2014-04-01 | The Boeing Company | Methods and systems for loading data into a temporal data warehouse |
CN102004631A (zh) * | 2010-10-19 | 2011-04-06 | 北京红旗中文贰仟软件技术有限公司 | 信息文档的处理方法及装置 |
CN103348598B (zh) | 2011-01-28 | 2017-07-14 | 起元科技有限公司 | 生成数据模式信息 |
CN102306200B (zh) * | 2011-09-22 | 2013-03-27 | 用友软件股份有限公司 | 增量数据操作语句的并发应用装置和方法 |
CN102388385B (zh) * | 2011-09-28 | 2013-08-28 | 华为技术有限公司 | 数据处理的方法和装置 |
US9323748B2 (en) | 2012-10-22 | 2016-04-26 | Ab Initio Technology Llc | Profiling data with location information |
US9892026B2 (en) | 2013-02-01 | 2018-02-13 | Ab Initio Technology Llc | Data records selection |
US11487732B2 (en) | 2014-01-16 | 2022-11-01 | Ab Initio Technology Llc | Database key identification |
EP3114578A1 (fr) | 2014-03-07 | 2017-01-11 | AB Initio Technology LLC | Opérations de profilage de données de gestion relatives à un type de données |
US9660930B2 (en) | 2014-03-17 | 2017-05-23 | Splunk Inc. | Dynamic data server nodes |
US9838346B2 (en) | 2014-03-17 | 2017-12-05 | Splunk Inc. | Alerting on dual-queue systems |
US9753818B2 (en) | 2014-09-19 | 2017-09-05 | Splunk Inc. | Data forwarding using multiple data pipelines |
US9922037B2 (en) | 2015-01-30 | 2018-03-20 | Splunk Inc. | Index time, delimiter based extractions and previewing for use in indexing |
WO2017118474A1 (fr) * | 2016-01-05 | 2017-07-13 | Huawei Technologies Co., Ltd. | Appareil et procédé de traitement de données et structure contenant des données |
CN106126658B (zh) * | 2016-06-28 | 2019-03-19 | 电子科技大学 | 一种基于虚拟存储器快照的数据库检查点构建方法 |
US11947978B2 (en) | 2017-02-23 | 2024-04-02 | Ab Initio Technology Llc | Dynamic execution of parameterized applications for the processing of keyed network data streams |
US10831509B2 (en) | 2017-02-23 | 2020-11-10 | Ab Initio Technology Llc | Dynamic execution of parameterized applications for the processing of keyed network data streams |
US11068540B2 (en) | 2018-01-25 | 2021-07-20 | Ab Initio Technology Llc | Techniques for integrating validation results in data profiling and related systems and methods |
CN109918391B (zh) * | 2019-03-12 | 2020-09-22 | 威讯柏睿数据科技(北京)有限公司 | 一种流式事务处理方法及系统 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3426428B2 (ja) * | 1995-10-27 | 2003-07-14 | 富士通株式会社 | トランザクションのトレース装置 |
US5721918A (en) * | 1996-02-06 | 1998-02-24 | Telefonaktiebolaget Lm Ericsson | Method and system for fast recovery of a primary store database using selective recovery by data type |
US5909681A (en) * | 1996-03-25 | 1999-06-01 | Torrent Systems, Inc. | Computer system and computerized method for partitioning data for parallel processing |
EP0840517A3 (fr) * | 1996-10-31 | 2003-09-10 | Matsushita Electric Industrial Co., Ltd. | Méthode et appareil de décodage d'un flux de données vidéo |
KR100198805B1 (ko) * | 1996-11-22 | 1999-06-15 | 정선종 | 분석 단계에서 트랜잭션 테이블 초기화 기법을 이용한 댕글링 트랜잭션 발생 방지 방법 |
-
2000
- 2000-06-19 JP JP2001505311A patent/JP4600847B2/ja not_active Expired - Fee Related
- 2000-06-19 CN CNB008105707A patent/CN100375088C/zh not_active Expired - Lifetime
- 2000-06-19 AU AU56247/00A patent/AU5624700A/en not_active Abandoned
- 2000-06-19 WO PCT/US2000/016839 patent/WO2000079415A2/fr not_active Application Discontinuation
- 2000-06-19 EP EP00941551A patent/EP1314100A2/fr not_active Withdrawn
- 2000-06-19 KR KR1020017016276A patent/KR20020041337A/ko not_active Application Discontinuation
Also Published As
Publication number | Publication date |
---|---|
CN100375088C (zh) | 2008-03-12 |
WO2000079415A3 (fr) | 2003-02-27 |
KR20020041337A (ko) | 2002-06-01 |
CN1575464A (zh) | 2005-02-02 |
JP2004500620A (ja) | 2004-01-08 |
JP4600847B2 (ja) | 2010-12-22 |
WO2000079415A2 (fr) | 2000-12-28 |
WO2000079415A8 (fr) | 2001-04-05 |
EP1314100A2 (fr) | 2003-05-28 |
AU5624700A (en) | 2001-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7752299B2 (en) | Segmentation and processing of continuous data streams using transactional semantics | |
US7392320B2 (en) | Segmentation and processing of continuous data streams using transactional semantics | |
WO2000079415A9 (fr) | Segmentation et traitement de flux de donnees continues au moyen de la semantique transactionnelle | |
US20200228392A1 (en) | Method and system for clustering event messages and manage event-message clusters | |
US10810103B2 (en) | Method and system for identifying event-message transactions | |
US7313575B2 (en) | Data services handler | |
US7010538B1 (en) | Method for distributed RDSMS | |
US10339465B2 (en) | Optimized decision tree based models | |
US7941524B2 (en) | System and method for collecting and storing event data from distributed transactional applications | |
US6029174A (en) | Apparatus and system for an adaptive data management architecture | |
US6502133B1 (en) | Real-time event processing system with analysis engine using recovery information | |
US6681230B1 (en) | Real-time event processing system with service authoring environment | |
US20090313157A1 (en) | Systems and methods for general aggregation of characteristics and key figures | |
US8830831B1 (en) | Architecture for balancing workload | |
US20040194090A1 (en) | Tracking customer defined workloads of a computing environment | |
US7103872B2 (en) | System and method for collecting and transferring sets of related data from a mainframe to a workstation | |
US8027996B2 (en) | Commitment control for less than an entire record in an in-memory database in a parallel computer system | |
US8386732B1 (en) | Methods and apparatus for storing collected network management data | |
Kvet et al. | Enhancing Analytical Select Statements Using Reference Aliases | |
US20240111608A1 (en) | Event-message collection, processing, and storage systems that are configurable to facilitate scaling, load-balancing, and selection of a centralization/decentralization level | |
EP4244731B1 (fr) | Le stockage et la recherche de donnees dans des memoires de donnees | |
US20240264977A1 (en) | Storing and searching for data in data stores | |
WO2023073414A1 (fr) | Stockage et recherche de données dans des magasins de données | |
Bünzli | Design and Implementation of Algorithms and Heuristics to Optimize a Data Generation and Preparation System for Credit Card Fraud Detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AU CA DE GB JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
AK | Designated states |
Kind code of ref document: C1 Designated state(s): AU CA CN DE DE GB JP KR NZ |
|
AL | Designated countries for regional patents |
Kind code of ref document: C1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
CFP | Corrected version of a pamphlet front page | ||
CR1 | Correction of entry in section i |
Free format text: PAT. BUL. 52/2000 UNDER (81) ADD "CN, DE (UTILITY MODEL), NZ"; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2001 505311 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020017016276 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 56247/00 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000941551 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 008105707 Country of ref document: CN |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 1020017016276 Country of ref document: KR |
|
AK | Designated states |
Kind code of ref document: C2 Designated state(s): AU CA CN DE DE GB JP KR NZ |
|
AL | Designated countries for regional patents |
Kind code of ref document: C2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
COP | Corrected version of pamphlet |
Free format text: PAGES 1/7-7/7, DRAWINGS, REPLACED BY NEW PAGES 1/7-7/7; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE |
|
WWP | Wipo information: published in national office |
Ref document number: 2000941551 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1020017016276 Country of ref document: KR |