CN114860684A - Stream data access method and device in stream data storage system - Google Patents

Stream data access method and device in stream data storage system Download PDF

Info

Publication number
CN114860684A
CN114860684A CN202110474997.4A CN202110474997A CN114860684A CN 114860684 A CN114860684 A CN 114860684A CN 202110474997 A CN202110474997 A CN 202110474997A CN 114860684 A CN114860684 A CN 114860684A
Authority
CN
China
Prior art keywords
version
schema
stream data
data
storage system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110474997.4A
Other languages
Chinese (zh)
Inventor
范振勇
张学
查伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/CN2021/100270 priority Critical patent/WO2022166071A1/en
Publication of CN114860684A publication Critical patent/CN114860684A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for accessing stream data in a stream data storage system, relates to the technical field of data processing, and can quickly execute the schema change of a stream data mode and improve the data analysis efficiency. The stream data storage system stores stream data with a first version of a stream data mode schema and stream data with a second version of the stream data mode schema, and the first version of the stream data mode schema and the second version of the stream data mode schema correspond to a plurality of fields respectively; and the field of the first version of the schema and the field of the second version of the schema are at least partially different. The method comprises the following steps: the stream data storage system receives a stream data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the stream data storage system stores the fields of the first version of schema in a column manner in the stream data storage system; and acquiring data from the corresponding column of the first version of streaming data of the schema stored according to the column according to the field of the second version of the schema.

Description

Stream data access method and device in stream data storage system
The present application claims priority of chinese patent application entitled "method and apparatus for managing streaming data mode" filed by the national intellectual property office at 04/02/2021 under the application number 202110154643.1, the entire contents of which are incorporated herein by reference.
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for accessing streaming data in a streaming data storage system.
Background
Stream data is a set of sequential, large, fast, continuous arriving data sequences, which can be generally viewed as a dynamic collection of data that grows indefinitely as time goes on. For example, the streaming data may be processing data in a network monitoring system, a sensor network, an aerospace system, a weather measurement and control system, a financial service system, and the like. The stream data schema, also known as metadata for stream data, is used to describe the composition of binary stream data, for example, the stream data schema may include fields for stream data, each field having a corresponding name, type, or default value, etc.
With the rapid development of the Internet, Internet of things (IoT) and the fifth Generation Communication Technology (5 th Generation Mobile Communication Technology, 5G), the size of the streaming data is increasing, and the requirements of real-time processing such as data mining and information extraction of the streaming data are also increasing synchronously. Among them, an application (data consumer) using stream data needs to decode binary stream data into structured data according to a stream data schema definition, and then can perform data cleaning, data analysis, and the like. As streaming data continues to be collected and stored, the schema may be changed, for example, by adding or deleting a field, modifying a default value of a field, etc., different versions of the schema may be formed. When a consumer reads stream data, the requirement that a new version of application reads old version data or an old version of application reads new version data exists. This requires that the encoding format of the stream data be backward and forward compatible. Therefore, the storage management of the streaming data can realize the front-back compatibility of the streaming data access of different versions of the schema, thereby enabling a data consumer to rapidly analyze the streaming data according to the schema definition.
The current schema Registry service supports stream data schema management in the Protobuf format. The stream data in the Protobuf format is stored in the linestore format. When the version of the schema used by the consumer is different from the schema version of the read streaming data, the streaming data storage system needs to load data corresponding to all the fields of the schema of the streaming data according to the schema of the streaming data to be read, load data corresponding to the fields of the schema used by the consumer from the loaded streaming data according to the fields of the schema used by the consumer, and return corresponding data to the consumer. Therefore, for stream data with large data size and variable schema versions, the method is low in efficiency of reading the stream data.
Disclosure of Invention
The application provides a method and a device for accessing stream data in a stream data storage system, which can quickly execute schema change, so that a data consumer can quickly analyze the stream data according to schema definition, and the stream data reading efficiency is improved.
In order to achieve the purpose, the technical scheme is as follows:
in a first aspect, a method for accessing stream data in a stream data storage system is provided, where the stream data storage system stores stream data with a first version of a stream data mode schema and stream data with a second version of the stream data mode schema; wherein the first version of stream data of the schema and the second version of stream data of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the field of the first version of the schema and the field of the second version of the schema are at least partially different; the method comprises the following steps: receiving a stream data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is that the stream data of the first version is stored in the stream data storage system according to the fields of the schema of the first version by columns; and acquiring data from the corresponding column of the first version of streaming data of the schema stored according to the column according to the field of the second version of the schema.
In the above technical solution, when the stream data storage system processes a stream data reading request, a column identifier corresponding to a field identifier of a schema version of stream data requested to be read may be searched from a column storage space corresponding to the field identifier according to the current schema version of a consumer, so as to read corresponding data. The adoption of the column storage format can improve the data analysis efficiency of the stream data storage system, and the data corresponding to all fields of the schema of the stream data does not need to be read when the stream data is read, so that the data reading efficiency is higher, and the occupation of the bandwidth by the data reading is reduced.
In a possible implementation manner, the column identifier stored in columns in the stream data storage system corresponds to the field identifier of the first version of schema; the stream data storage system stores the stream data of which the schema is the first version according to the fields of the schema of the first version by columns, and specifically includes: and storing the stream data of which the schema is the first version into the column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
In the above possible implementation manner, the field identifier of the schema version in the stream data storage system corresponds to the column identifier of the column storage space, the stream data storage system may correspond the schema version to the stream data corresponding to the schema version one by one, the schema adopts multi-version management, and the stream data is separately stored according to the schema version. Therefore, when data are read, the stream data storage system can read the column storage space corresponding to the column identification in the stream data storage system according to the field identification of the schema version of the consumer, the stream data can be read quickly, and the data analysis efficiency is improved.
In one embodiment, the stream data read request includes a data read interval, and the data read interval is used to indicate a storage location where the first version of stream data is read by the read schema.
In the possible implementation manner, the stream data storage system may determine the storage location of the stream data for data reading according to the data reading interval included in the stream data reading request, so that the flexibility of stream data reading is improved.
In an embodiment, acquiring data from a corresponding column of stream data of which the schema is the first version stored in columns according to a field of the schema of the second version specifically includes: searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version; and reading the data stored in the column storage space indicated by the target column identification according to the reading interval.
In the possible implementation manner, the stream data storage system may search the corresponding column storage space from the schema field of the first version in the column storage space for data reading according to the field of the second version of the schema currently used by the consumer, so that the flexibility and the reading efficiency of data reading are improved.
In an embodiment, acquiring data from a corresponding column of stream data of which the schema is the first version stored in columns according to a field of the schema of the second version specifically includes: according to a conversion operator between the schema of the second version and the schema of the first version, obtaining a conversion algorithm of a field required for reading stream data of the schema of the first version according to the schema of the second version; and reading the data of the corresponding column of the first version of the stream data by the schema according to the conversion algorithm.
In the possible implementation manner, when the stream data storage system processes the stream data reading request, the conversion algorithm for the stream data of different versions in the reading request can be obtained according to the conversion operators between different schema versions, so that the flexibility and the reading efficiency of data reading can be improved.
In one embodiment, before receiving a stream data read request sent by a consumer, the method further comprises: receiving stream data of which the schema is a first version; and storing the schema into the first version of stream data according to the fields of the first version of the schema in columns.
In the possible implementation manner, the stream data storage system can convert the stream data written in the history into the column storage format, so that the compression rate of data storage can be improved, and the data reading efficiency can be improved.
In one embodiment, the method further comprises: the schema is received as a second version of the streaming data.
In the above possible implementation manner, the writing time of the stream data may strictly correspond to the change of the schema version, and when the schema version changes, the stream data storage system may add a new schema version, where the stream data corresponding to the new schema version corresponds to a new storage location, such as a new directory. Therefore, the stream data storage system can quickly adapt to the change of the schema version without modifying the existing historical data of the stream data.
In one embodiment, the stream data storage system stores the second version of the stream data in a schema by rows.
In the possible implementation manner, the stream data storage system may store the newly written stream data in lines, so as to improve the writing efficiency of the stream data and avoid affecting other applications in the stream data storage system.
In an embodiment, the field of the first version of schema and the field of the second version of schema are at least partially different, and specifically includes: the second version of the schema has fields added or deleted from the first version of the schema or the second version of the schema differs from the first version of the schema by default values for the same fields.
In the possible implementation mode, the stream data storage system performs schema multi-version management on stream data storage, creates schemas of different versions, and when the schema versions change, a new schema version can be added, the stream data corresponding to the new schema version corresponds to a new storage position, and can perform differentiated storage management according to the stream data of different schema versions, so that the stream data storage system does not need to modify existing historical data of the stream data, the stored schema versions of the stream data are compatible with each other, and can quickly adapt to the schema version change and improve the stream data reading efficiency.
In a second aspect, a stream data storage apparatus is provided, which is applied to a stream data storage system that stores stream data of which a schema is a first version and stream data of which the schema is a second version; wherein the first version of stream data of the schema and the second version of stream data of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the field of the first version of the schema and the field of the second version of the schema are at least partially different; the device includes: the transmission module is used for receiving a streaming data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is that the stream data of the first version is stored in the stream data storage device according to the fields of the schema of the first version by columns; and the processing module is used for acquiring data from the corresponding column of the stream data of which the schema is the first version stored according to the column according to the field of the schema of the second version.
In one embodiment, the processing module is further configured to: and storing the stream data of which the schema is the first version into a column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
In one embodiment, the stream data read request includes a data read interval, and the data read interval is used to indicate a storage location where the first version of stream data is read by the read schema.
In one embodiment, the processing module is specifically configured to: searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version in the stream data storage device; and reading the stream data stored in the column storage space indicated by the target column identification according to the reading interval.
In an embodiment, the processing module is further specifically configured to: according to a conversion operator between the schema of the second version and the schema of the first version, obtaining a conversion algorithm of a field required for reading stream data of the schema of the first version according to the schema of the second version; and reading the data of the corresponding column of the first version of the stream data by the schema according to the conversion algorithm.
In one embodiment, the transmission module is further configured to: receiving stream data of which the schema is a first version; the processing module is specifically further configured to: and storing the schema into the first version of stream data according to the fields of the first version of the schema in columns.
In one embodiment, the transmission module is further configured to: the schema is received as a second version of the streaming data.
In an embodiment, the processing module is further specifically configured to: the schema is stored in rows as a second version of the streaming data.
In an embodiment, the field of the first version of schema and the field of the second version of schema are at least partially different, and specifically includes: the second version of the schema has fields added or deleted from the first version of the schema or the second version of the schema differs from the first version of the schema by default values for the same fields.
In a third aspect, a stream data storage system is provided, where the stream data storage system stores stream data with a first version of a stream data schema and stream data with a second version of the stream data schema; wherein the first version of stream data of the schema and the second version of stream data of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the field of the first version of the schema and the field of the second version of the schema are at least partially different; the electronic device comprises an interface and a processor, wherein the interface is communicated with the processor; the processor is configured to: receiving a streaming data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is that the stream data of the first version is stored in the stream data storage system according to the fields of the schema of the first version by columns; and acquiring data from the corresponding column of the first version of streaming data of the schema stored according to the column according to the field of the second version of the schema.
In one embodiment, the processor is further configured to: and storing the stream data of which the schema is the first version into the column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
In one embodiment, the stream data read request includes a data read interval, and the data read interval is used to indicate a storage location where the first version of stream data is read by the read schema.
In one embodiment, the processor is specifically configured to: searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version; and reading the data stored in the column storage space indicated by the target column identification according to the reading interval.
In one embodiment, the processor is specifically configured to: according to a conversion operator between the schema of the second version and the schema of the first version, obtaining a conversion algorithm of a field required for reading stream data of the schema of the first version according to the schema of the second version; and reading the data of the corresponding column of the first version of the stream data by the schema according to the conversion algorithm.
In one embodiment, the processor is further configured to: receiving stream data of which the schema is a first version; and storing the schema into the first version of stream data according to the fields of the first version of the schema in columns.
In one embodiment, the processor is further configured to: the schema is received as a second version of the streaming data.
In one embodiment, the stream data storage system stores the schema as the second version of the stream data by rows.
In an embodiment, the field of the first version of schema and the field of the second version of schema are at least partially different, and specifically includes: the second version of the schema has fields added or deleted from the first version of the schema or the second version of the schema differs from the first version of the schema by default values for the same fields.
In a fourth aspect, a computer-readable storage medium is provided, the computer-readable storage medium containing computer instructions, the computer instructions being applied to a streaming data storage system, the streaming data storage system storing streaming data with a first version of a schema and a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; the processor executing the computer instructions causes the streaming data storage system to perform the method of accessing streaming data in a streaming data storage system as described in any one of the first aspects above.
In a fifth aspect, there is provided a computer program product comprising computer instructions for use in a streaming data storage system storing streaming data having a first version of a schema and a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; when the processor executes the computer instructions, the streaming data storage system executes the streaming data access method in the streaming data storage system according to any one of the first aspect.
Drawings
Fig. 1 is a system architecture diagram of a streaming data storage service provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for accessing streaming data in a streaming data storage system according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another streaming data access method in a streaming data storage system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of another streaming data storage system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a management apparatus for stream data mode according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of another streaming data storage system according to an embodiment of the present application.
Detailed Description
In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, the meaning of "a plurality" is two or more unless otherwise specified.
It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, a brief introduction is made to an implementation environment and an application scenario of the embodiment of the present application.
As shown in fig. 1, the embodiment of the present application may be applied to a streaming data storage service system, which may include a streaming data storage system (also referred to as a storage device), a streaming data producer (simply referred to as a producer), and a streaming data consumer (simply referred to as a consumer). As shown in fig. 1, a producer may include a plurality of producers, such as producer 1 and producer 2, and a consumer may include a plurality of consumers, such as consumer 1 and consumer 2.
The generator may be configured to write Streaming data (Streaming data) into the Streaming data storage system, and a Streaming storage service layer of the Streaming data storage system may call the communication interface to obtain and store the Streaming data. Illustratively, the generator may write stream data continuously for a certain period of time. The generator may specifically be an IoT, a device that collects streaming data in the internet, or a transit device that forwards the streaming data, and may also be an Extract-Transform-Load (ETL) module inside the streaming data storage system.
For a certain stream data theme (Topic), the stream data pattern schema written by different versions of the producer may be different, and in addition, the stream data pattern schema may be changed with the writing time of the stream data written by the producer. The streaming data Topic is a streaming data unit that can be sent and subscribed, and consumers and producers can read and write the streaming data according to the Topic. Topic is a logical concept for storing stream data, and is a collection of stream data. Each stream data has a category. Physically, the streaming data of different topics are stored separately, and each Topic may have multiple producers sending the streaming data to it or multiple consumers consuming the streaming data.
For example, the schema of the stream data written by the producer 1 at time 1 includes field 1, field 2, and field 3, and the stream data storage system may define the version of the schema as schema version 1. The stream data schema written at time 2 comprises field 1, field 2, field 3 and field 4. I.e. the stream data schema at time 2 is increased by field 4 compared to the historical data, the stream data storage system may define the version of the schema as schema version 2. The stream data schema written by the producer 2 includes a field 1 and a field 2, and the stream data storage system may define the version of the schema as schema version 3.
The schema version set in the stream data storage system may include a plurality of schema versions and attributes corresponding to each schema version. Multiple versions of the streaming data storage system may be used to store different schema versions of streaming data.
The consumer may be configured to request the stream data storage system to read the stream data, and specifically, may be an application or an engine that calls the stream data, such as a Spark or a Flink data engine, and may also be an ETL module of the stream data storage system.
Wherein the schema of the streaming data used by the consumer may be different from the schema of the read streaming data. For example, the consumer 1 needs to read the stream data of the schema version 2 using the schema version 1, the consumer 2 needs to read the stream data of the schema version 3 using the schema version 2, and so on.
Based on this, embodiments of the present application provide a method and an apparatus for accessing stream data in a stream data storage system, so that by storing stream data of one or more schema versions in columns, the stream data storage system can quickly execute stream data reading of different schema versions, thereby improving the efficiency of reading stream data. A consumer can quickly analyze the stream data according to the schema definition, namely, stream data of any schema version can be read through one schema version, so that front-back compatibility of different schema versions is realized, the efficiency of analyzing and converting the stream data of different versions is high enough, and other application services are not influenced.
As shown in fig. 2, an embodiment of the present application provides a method for accessing streaming data in a streaming data storage system, which may be applied to the streaming data storage system, and the method may include the following processes.
201: the stream data storage system stores stream data of which the schema is a first version and stream data of which the schema is a second version.
The stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic, and the version identifier of the first version and the version identifier of the second version both correspond to the identifier of the Topic.
Illustratively, the Topic includes at least streaming data with a first version of schema and streaming data with a second version of schema. The fields of the first version of the schema are not identical to the fields of the second version of the schema. For example, the first version of the schema and the second version of the schema correspond to a plurality of fields, respectively, and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different.
The stream data storage system can store the stream data with the first version of the schema according to the fields of the first version of the schema in columns. Storing the stream data in columns may provide a compression rate of the stream data.
In one embodiment, as shown in fig. 3, a specific process of the stream data storage system storing the first version of stream data of the schema and the second version of stream data of the schema in step 201 may include the following steps.
301: the stream data storage system creates a first version of stream data of a stream data schema (schema) corresponding to a stream data Topic (Topic).
Wherein the version identification of the first version corresponds to the streaming data Topic identification. The stream data Topic corresponds to a unique Topic identifier, that is, each stream data Topic corresponds to a unique Topic identifier, and the unique Topic identifier can determine the corresponding stream data Topic. Illustratively, one specific implementation of the version identification may be a version number.
Illustratively, the Topic may be implemented by using a serial number or a Universal Unique Identifier (UUID), etc., when creating the Topic, a Topic Identifier is generated, and after deleting the Topic, the corresponding Topic Identifier is discarded and is no longer enabled.
In one embodiment, when the stream data storage system creates a stream data Topic, then deletes the Topic, and creates a stream data Topic with the same name, the new Topic has different identifications, i.e. the unique identifications of the two previous and next topics are different.
The stream data storage system may obtain the corresponding schema definition according to the Topic identifier and the schema version identifier. One stream data Topic may correspond to one or more version identifiers of the stream data mode schema, that is, one stream data Topic may have one or more schema versions.
The schema version identifier may be a monotonically increasing serial number, an initial schema version identifier is generated when a schema of stream data Topic is created, and then the corresponding schema version identifier is increased each time a schema version is modified. For example, the first version of the schema corresponding to Topic ID1 is V1, the second version is V2, and so on.
In one embodiment, each schema version corresponds to a plurality of fields, and each field corresponds to a field identifier and is a unique field identifier. Each time a new field is added, a new field identification is assigned.
Illustratively, the field identifier may be a monotonically increasing sequence number, the initial value is 0, and when a field is added to the schema version, the sequence number is added by 1 to be allocated to the field as the field identifier. When one field modifies the default value, the field identification is not changed; when a field modifies the data type of the field, such as the data type of the field is modified from a 16-bit integer to a 32-bit integer, the identity of the field may not change. If the data type of the field is modified from a character string to a 32-bit integer, the method is equivalent to deleting the original field, creating a new field, and changing the identification of the new field.
In one embodiment, when a field is deleted and a field with the same name is added, the corresponding field identifications of the two fields with the same name are different.
In addition, each field corresponds to a default value, also referred to as a default value. That is, when the streaming data written by the data producer does not include the field, the streaming data storage system may read the default value corresponding to the field when reading the piece of streaming data.
302: the stream data storage system receives stream data with the schema in a first version.
For example, the streaming data storage system may receive a data write request from a data producer, receiving a schema as a first version of the streaming data.
The stream data storage system stores the written stream data according to the version of the schema, namely, the schema with multiple versions is adopted for storage. The stream data storage system may correspond the schema version to the stream data corresponding to the schema version one by one, the schema adopts multi-version management, and the stream data is separately stored according to the schema version, for example, a directory stores the stream data of a schema version.
The storage of the stream data can be realized by performing cold-hot separation storage on the data according to the writing time. The writing time of the stream data and the change of the schema version can strictly correspond, when the schema version changes, the stream data storage system can add a new schema version, and the stream data corresponding to the new schema version corresponds to a new storage position, such as a new directory. Therefore, the stream data storage system can quickly adapt to the change of the schema version without modifying the existing historical data of the stream data. The streaming data storage system may also store streaming data of a new version of the schema to a high performance storage medium and store streaming data of a historical version of the schema to a relatively low performance storage medium. Further, the stream data of the new version of the schema can be stored according to rows, and the stream data of the historical version of the schema can be stored according to columns.
In one embodiment, the schema of the stream data newly written by the producer is different from the schema of the first version, and the stream data storage system creates a second version of the schema corresponding to the subject of the stream data Topic. The schema for writing the stream data is different from the schema of the first version, and specifically may include: the schema of the write stream data adds, modifies or deletes one or more fields included in the schema of the first version, or modifies default values of fields included in the schema of the first version, as compared to the schema of the first version.
Illustratively, if the schema of the stream data newly written by the producer is increased by one field 4 compared to the schema of the first version, the stream data storage system creates a second version V2 of the schema corresponding to the Topic of the stream data Topic, the second version V2 is increased by the field 4 compared to the first version V1, and a default value corresponding to the field 4 is defined. The subsequently written stream data may be stored in the storage area or the directory corresponding to the second version.
In this embodiment of the application, the stream data storage system may store, according to different versions of the schema of the stream data, the stream data to a storage location of a corresponding schema version according to a writing timing of the stream data, for example, store the stream data to a directory of the corresponding schema version according to the writing timing. When the schema version of the written stream data changes, the stream data storage system only needs to add a new schema version and store the stream data to the storage position corresponding to the schema version, so that the stream data analysis and storage are completed, and programs such as a producer and a consumer do not need to sense.
The distinction of cold and hot data in the embodiments of the present application refers to cold and hot data in the category of the life cycle of stream data, that is, newly written stream data is referred to as hot stream data, and historically written stream data is referred to as cold stream data. The stream data storage system may be configured to distinguish time periods of cold and hot data or based on the time that the stream data has been written, for example, the stream data with the writing time within the time period is the hot stream data, and the stream data with the writing time outside the time period is the cold stream data.
In one embodiment, the hot stream data may be stored in rows, for example, the stream data storage system may parse the hot stream data by using a Key-Value mapping model such as Protobuf or swift, and as long as the field ID and the field Value of the stream data are corresponding to each other, the stream data version can be compatible before and after.
303: the stream data storage system converts the stream data of which the schema is the first version into the data stored in columns corresponding to the fields of the schema of the first version.
That is, the stream data storage system stores the stream data of which the schema is the first version into the column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
In one embodiment, the streaming data storage system may convert cold streaming data into a columnar format.
Specifically, the stream data storage system may perform data analysis on the stream data according to the schema version, and then store the stream data according to columns, where the column identifier corresponds to a field identifier in the schema definition, that is, the column identifier of the column format data corresponding to the first version corresponds to the field identifier of the first version one to one.
For example, the storage format may be implemented by using a general Optimized Row Column (ORC) storage format, Apache request, Apache Carbondata, or the like, and storing stream data of one version of schema per policy, for example, stream data of different schema versions are stored under different subdirectories. Apache partial is a column storage format capable of effectively storing nested data, and Apache Carbondata is a novel large data file storage format.
The column storage format refers to that data is stored in columns, that is, each column is stored separately, for example, all values corresponding to field 1 included in the stream data, all values corresponding to field 2 included in the stream data, and the like. If the data type of field 1 is int (integer type), then field 1 corresponds to the column store data set-all being integer type data.
Specifically, the column identifier of the column storage data corresponding to the first version of the schema corresponds to the field identifier of the schema of the first version one by one, and the column identifier of the column storage data corresponding to the schema of the second version corresponds to the field identifier of the schema of the second version one by one.
For example, the written stream Data is converted into a line storage format, and the conversion may use an optimization means such as Just In Time (JIT), Single Instruction Multiple Data (SIMD) or vector Instruction. This is not limited in this application.
304: the stream data storage system receives the stream data with the schema in the second version.
Wherein the stream data storage system may store the schema as the second version of the stream data by lines.
It should be noted that, in the examples of the present application, only the streaming data with the schema as the first version is taken as the historical version of the streaming data, that is, the cold streaming data, and the streaming data with the schema as the second version of the streaming data is taken as the new version of the streaming data, that is, the hot streaming data, which is not limited to the embodiments of the present application.
In the above embodiment, the writing efficiency of the line-by-line storage is high, and the writing efficiency of the hot stream data can be improved. And the data types of the data corresponding to the same column identification in the stream data stored according to the columns are consistent, the data characteristics are similar, and the data compression efficiency is higher. Therefore, for massive historical stream data, the compression ratio can be greatly improved by storing the stream data in columns, the occupation of storage space is saved, and the occupation of bandwidth in data reading can be reduced.
In one embodiment, the streaming data storage system may set a corresponding default value, i.e., a default value, for each field. The stream data is converted into the data stored according to the column according to the current version of the schema, and the stream data can be converted into the schema version corresponding to the consumer when the consumer reads the data. Specifically, when the consumer reads the historical streaming data according to different schema versions, a field which is not available in the schema of the historical streaming data can be read based on the default value of the field.
202: the stream data storage system receives a stream data read request sent by a consumer.
For example, the version of the schema of the streaming data used by the consumer is the second version, and the streaming data read request is used to request the schema of the streaming data of the first version.
Wherein the version of the schema of the streaming data used by the consumer may be a current schema version of the streaming data storage system, e.g., the second version; alternatively, it may be a historical schema version, e.g., the first version. In the embodiments of the present application, only the example that the schema version used by the consumer is inconsistent with the current schema version of the stream data storage system is given as an example, and the embodiments of the present application are not limited thereto.
203: the stream data storage system retrieves data from a corresponding column of the first version of stream data stored in columns according to fields of the second version of the schema.
In one embodiment, the stream data read request in step 202 may include a data read interval, where the data read interval is used to indicate a storage location where the first version of stream data is read from the schema.
Further, in step 203, the stream data storage system obtains data from a corresponding column of the stream data of the first version in which the schema stored in the column is according to the field of the schema of the second version, which may specifically include:
1. and searching the target column identification corresponding to the field identification of the schema of the second version from the field identification of the schema of the first version in the stream data storage system.
2. And reading the data stored in the column storage space indicated by the target column identification according to the reading interval.
In the foregoing embodiment of the present application, the stream data storage system switches to store the written stream data in columns, so that when the schema version of the stream data of the consumer does not coincide with the schema version of the stream data requested to be read, the stream data storage system may search, according to a field of the schema version requested to be read, a column identifier corresponding to the field identifier in a column storage space, thereby reading corresponding data. The data analysis efficiency of the stream data storage system can be improved by storing the stream data in columns, the data of other fields except the specified field does not need to be read when the stream data is read, the data reading is more efficient, and the occupation of the data reading on the bandwidth is reduced.
For example, in connection with the stream data storage system shown in fig. 1, the following describes an embodiment of the present application by taking an example in which the stream data storage system creates a Topic of stream data, writes the stream data, and a subsequent consumer reads the stream data.
1. The stream data storage system creates a schema version of Topic1 as V1, and V1 includes fields of a and b, where the data type of field a is integer type int, the field is identified as 1, and the default value is 10. The data type of the field b is a variable-length character string varchar, the field identification is 2, the default value is null, and the variable length of the character string is maximum 128 bits.
For example, the streaming data storage system executes the following instructions:
create topic t1(a int default 10, b varchar (128) default "). The create topic instruction may be used for the stream data storage system to create a version of the schema of the stream data topic, for example, to create a version of the schema corresponding to the topic identifier t 1. The create topic instruction may also include a field type and a default value included in the version of the schema.
2. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 1, and the corresponding schema of the stream data has a version V1.
3. When the schema of the streaming data written by the streaming data producer includes the new field c, the streaming data storage system creates a new schema for the streaming data currently written by Topic1, version V2, V2 with the column c added to V1.
For example, the data type of field c is integer type int, the field id is 3, and the default value is 100.
4. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 2, and the corresponding schema of the stream data has a version V2.
5. The stream data storage system needs to modify the default value of the field c of the schema corresponding to Topic1, and then the stream data storage system creates a new schema version V3 for Topic1, and V3 modifies the default value of the field c compared with V2.
For example, at this time, the data type of the field c of the schema corresponding to the V3 version is still the integer type int, the field identifier is still 3, and the default value is 50.
6. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 3, and the corresponding schema of the stream data has a version V3.
7. And the stream data storage system needs to delete the field c of the schema corresponding to Topic1, and then the stream data storage system creates a new schema version of V4 for Topic1, and V3 deletes the field c compared with V3.
8. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 4, and the corresponding schema of the stream data has a version V4.
9. The stream data storage system needs to add field c of the schema of Topic1, then the stream data storage system creates a new version of the schema for Topic1 as V5, with column c modified by V5 versus V4.
For example, at this time, the data type of the field c corresponding to the schema of the V5 version is still the integer type int, the field identification is changed to 4, and the default value is 200.
10. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 5, and the corresponding schema of the stream data has a version V5.
11. The stream data storage system needs to delete field a of the schema of Topic1, and then the stream data storage system creates a new version of the schema for Topic1 as V6, and V6 deletes field a in contrast to V5.
12. The stream data storage system stores the stream data written by the stream data producer in columns, creates a new storage directory 6, and the corresponding schema of the stream data has a version V6.
For example, the consumer takes the data reading results of several scenarios as an example:
scene 1: the version of the stream data schema of the consumer is V3, the historical data in the storage directory 1 is read, the version of the stream data schema in the storage directory 1 is V1, and the default value of the field c for automatically increasing the data read by the consumer is 50 of the V3 version.
Scene 2: the version of the stream data schema of the consumer is V5, the historical data in the storage directory 1 is read, the version of the stream data schema in the storage directory 1 is V1, and the default value of c for automatically increasing the data read by the consumer is 200 of the V5 version.
Scene 3: the stream data schema version of the consumer is V6, and the history data in the storage directory 1 is read, then the default value of the field a for automatically increasing the data read by the consumer is 10 of the V1 version.
According to the embodiment of the application, aiming at the storage specification of mass stream data, through the schema multi-version management, the schema before and after versions of the stored stream data are compatible, when the schema version changes, historical data do not need to be modified, the existing data are not affected, the data are analyzed and read quickly, and the data analysis efficiency is high.
Furthermore, during data writing, Protobuf, Thrift and the like can be adopted to support the encoding of the KV model, and the writing efficiency is high. The storage device can store the written stream data according to columns, and the column ID of the column storage format data corresponds to the field ID in the schema version one by one, so that efficient data format conversion and identification management are realized.
In an embodiment, the stream data storage system may further generate a conversion operator between any two schema versions, and generate a conversion operator corresponding to any two schema versions corresponding to the Topic of stream data Topic, so that the stream data storage system or the consumer may obtain a conversion algorithm between any two different schema versions according to the conversion operator, so as to read stream data of schemas in any version.
Further, in step 203 in the foregoing embodiment, the acquiring, by the stream data storage system, data from a corresponding column of the stream data of the first version stored in columns according to a field of the schema of the second version may specifically include:
1. and obtaining a conversion algorithm of a field required for reading the stream data of the first version according to the schema of the second version according to a conversion operator between the schema of the second version and the schema of the first version.
2. And reading the data of the corresponding column of the first version of the stream data by the schema according to the conversion algorithm.
Illustratively, in view of saving time for generating the converter, as shown in fig. 4, the stream data storage system may further include a converter Cache (Cache) for storing conversion operators between schema versions. The generation of the converter needs To be calculated according To a conversion algorithm before different schema versions, and the stream data storage system may store converters corresponding To any two schema versions To the converter Cache, for example, [ From version V1, To version V2; converter 1, [ From version V3, To version V2; converter 2], etc. Therefore, the data analysis efficiency can be further improved.
For any two schema versions, stream data are stored in columns, and the difference between schema versions can be summarized into three types: deleting a field, adding a field, or modifying a field, and the three differences may correspond to 3 conversion operators.
Exemplary, V k Versions schema to V k+1 Conversion operator F (V) of version schema k ,V k+1 ) The following conditions may be satisfied:
F(V k ,V k+1 ) F1({ deleted column set }), f2({ newly added column set }), f3({ modified column set }).
Specifically, the operation function f1(x) corresponding to the deleted column may be executed first, then the operation function f2(x) of the newly added column may be executed, and finally the operation function f3(x) of the modified column may be executed, so as to obtain the final conversion operator.
Further, with respect to a conversion algorithm between two schema versions which are not adjacent to each other, the conversion algorithm may be obtained according to a union of conversion operators between a plurality of adjacent schema versions.
For example, V k Versions of schema to V m Conversion operator F (V) of version schema k ,V m ) The following conditions may be satisfied:
F(V k ,V m )=F(V k ,V k+1 )∪F(V k+1 ,V k+2 )∪......∪F(V m-1 ,V m )
in addition, if k>m, then V k Versions schema to V n The conversion operator of the version schema may be an inverse operation of the above algorithm, which is not described herein again.
In the above embodiment, when the schema changes, the multi-version storage does not need to change the existing data, the multi-version schema corresponds to different storage directories, and the column stored data may reflect operations such as adding, deleting, or modifying fields between the multi-version schema depending on the correspondence between the column identifier and the field identifier. And the consumer can rapidly realize the conversion and the analysis between different versions according to the conversion operator between the schema versions, thereby improving the data analysis efficiency.
In addition, an embodiment of the present application further provides a stream data storage device, which is applied to the stream data storage system, where the stream data storage device stores stream data of which a schema is a first version and stream data of which the schema is a second version; wherein the first version of stream data of the schema and the second version of stream data of the schema have the same stream data Topic; the version identification of the first version of the schema and the version identification of the second version of the schema both correspond to the identification of Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the field of the first version of the schema and the field of the second version of the schema are at least partially different.
As shown in fig. 5, the apparatus 500 may include a transmission module 501 and a processing module 502.
The transmission module 501 may be configured to receive a streaming data reading request sent by a consumer, where a version of a schema of streaming data used by the consumer is a second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the stream data storage system stores the fields of the first version of the schema in columns on the device 500.
The processing module 502 may be configured to obtain data from a corresponding column of the first version of streaming data stored in a column-wise manner according to a field of the second version of the schema.
In one embodiment, the column identification stored in the device 500 by column corresponds to the field identification of the first version of the schema. The processing module 502 may be specifically configured to: and storing the stream data of which the schema is the first version into the column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
In one embodiment, the stream data read request includes a data read interval, and the data read interval is used to indicate a storage location where the first version of stream data is read by the read schema.
In an embodiment, the processing module 502 may be specifically configured to: searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version in the stream data storage device 500; and reading the stream data stored in the column storage space indicated by the target column identification according to the reading interval.
In an embodiment, the processing module 502 may be further specifically configured to: according to a conversion operator between the schema of the second version and the schema of the first version, obtaining a conversion algorithm of a field required for reading stream data of the schema of the first version according to the schema of the second version; and reading the data of the corresponding column of the first version of the stream data by the schema according to the conversion algorithm.
In one embodiment, the transmission module 501 may further be configured to: the schema is received as a first version of the streaming data. The processing module 502 may be further specifically configured to: and storing the schema into the first version of stream data according to the fields of the first version of the schema in columns.
In one embodiment, the transmission module 501 may further be configured to: the schema is received as a second version of the streaming data.
In an embodiment, the processing module 502 may be further specifically configured to: the schema is stored in rows as a second version of the streaming data.
In an embodiment, the field of the first version of schema and the field of the second version of schema are at least partially different, and specifically includes: the second version of the schema has fields added or deleted from the first version of the schema or the second version of the schema differs from the first version of the schema by default values for the same fields.
In addition, the present application also provides a stream data storage system, as shown in fig. 6, where the stream data storage system 600 may be used to implement the method described in the above method embodiment, and specific reference may be made to the description in the above method embodiment. The streaming data storage system 600 comprises a processor 601 and an interface 602, the processor 601 and the interface 602 communicating. The processor 601 includes a Central Processing Unit (CPU) and a memory 603, in which computer instructions are stored, and the CPU executes the computer instructions in the memory 603 to execute the method described in the above embodiments. In addition, to save the computing resources of the CPU, a Field Programmable Gate Array (FPGA) or other hardware may also be used to execute all operations of the CPU in the embodiment of the present invention, or the FPGA or other hardware and the CPU are respectively used to execute partial operations of the CPU in the embodiment of the present invention. For convenience of description, embodiments of the present invention refer collectively to a combination of the CPU201 and memory, and to the various implementations described above. The Interface 602 may be a Network Interface Card (NIC) or a Host Bus Adapter (HBA).
The streaming data storage system provided by the embodiment of the present invention may be implemented specifically by including one or more servers, and a specific structure of the server may be as shown in fig. 6.
The embodiment of the present invention, the streaming data storage apparatus shown in fig. 5, may be implemented by software, for example, by a computer instruction for implementing the embodiment of the present invention; but also by hardware, or by software and hardware.
It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct bus RAM (DR RAM).
An embodiment of the present invention provides a computer program product including computer instructions, where the computer instructions are applied to a stream data storage system, where the stream data storage system stores stream data with a first version of a stream data schema and stream data with a second version of the stream data schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; execution of the computer instructions by the processor causes the streaming data storage system to perform the methods described in the previous embodiments.
An embodiment of the present invention further provides a computer-readable storage medium, which contains computer instructions, where the computer instructions are applied to a streaming data storage system, and the streaming data storage system stores streaming data with a first version of a schema and streaming data with a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; when the processor executes the computer instructions, the streaming data storage system executes the method described in the previous embodiment. Wherein the computer-readable storage medium may be a non-volatile computer-readable storage medium.
It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, communication device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is to be understood that steps or messages having the same function or the same step in the embodiments of the present application may be referred to with each other between different embodiments.
Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (29)

1. A stream data access method in a stream data storage system is characterized in that the stream data storage system stores stream data with a first version of a stream data mode schema and stream data with a second version of the stream data mode schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the method comprises the following steps:
receiving a stream data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is stream data of a first version and is stored in the stream data storage system according to fields of the schema of the first version in columns;
and acquiring data from the corresponding column of the stream data of which the schema is the first version stored according to the column according to the field of the schema of the second version.
2. The method according to claim 1, wherein column identifiers stored in columns in the stream data storage system correspond to field identifiers of the first version of schema; the stream data storage system stores, in columns, the stream data of which the schema is the first version according to the fields of the schema of the first version, and specifically includes:
and storing the stream data of which the schema is the first version into a column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
3. The method according to claim 1 or 2, wherein a data reading interval is included in the stream data reading request, and the data reading interval is used for indicating a storage location for reading the first version of stream data of the schema.
4. The method according to claim 3, wherein the obtaining data from the corresponding column of the stream data of which the schema is the first version stored by column according to the field of the schema of the second version specifically comprises:
searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version;
and reading the data stored in the column storage space indicated by the target column identification according to the reading interval.
5. The method according to claim 3, wherein the obtaining data from the corresponding column of the stream data of which the schema is the first version stored by column according to the field of the schema of the second version specifically comprises:
according to a conversion operator between the second version of the schema and the first version of the schema, obtaining a conversion algorithm of a field required for reading stream data of the first version of the schema according to the second version of the schema;
and reading the data of the corresponding column of the stream data of which the schema is the first version according to the conversion algorithm.
6. The method of any of claims 1-5, wherein prior to receiving the request for reading the streaming data sent by the consumer, the method further comprises:
receiving the stream data of which the schema is a first version;
and storing the stream data of which the schema is the first version according to the fields of the schema of the first version by columns.
7. The method according to any one of claims 1-6, further comprising:
and receiving the stream data of which the schema is the second version.
8. The method of claim 7, wherein the stream data storage system stores the schema as the second version of the stream data in rows.
9. The method according to any one of claims 1 to 8, wherein the field of the first version of schema and the field of the second version of schema are at least partially different, specifically comprising:
and adding or deleting fields in the second version of the schema compared with the first version of the schema, or enabling the second version of the schema to be different from the first version of the schema in default values of the same fields.
10. The stream data storage device is applied to a stream data storage system, and the stream data storage system stores stream data with a first version of a schema and stream data with a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the first version of the schema and the second version of the schema respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the device comprises:
the transmission module is used for receiving a streaming data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is stream data of a first version and is stored in the stream data storage device according to fields of the schema of the first version in columns;
and the processing module is used for acquiring data from the corresponding column of the stream data of which the schema is the first version stored according to the column according to the field of the schema of the second version.
11. The apparatus of claim 10, wherein the processing module is further configured to:
and storing the stream data of which the schema is the first version into a column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
12. The apparatus according to claim 10 or 11, wherein a data reading interval is included in the stream data reading request, and the data reading interval is used to indicate a storage location for reading the first version of stream data of the schema.
13. The apparatus of claim 12, wherein the processing module is specifically configured to:
searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version in the stream data storage device;
and reading the stream data stored in the column storage space indicated by the target column identification according to the reading interval.
14. The apparatus of claim 12, wherein the processing module is further specifically configured to:
according to a conversion operator between the second version of the schema and the first version of the schema, obtaining a conversion algorithm of a field required for reading stream data of the first version of the schema according to the second version of the schema;
and reading the data of the corresponding column of the stream data of which the schema is the first version according to the conversion algorithm.
15. The apparatus of any of claims 10-14, wherein the transmission module is further configured to: receiving the stream data of which the schema is a first version;
the processing module is specifically further configured to: and storing the stream data of which the schema is the first version according to the fields of the schema of the first version by columns.
16. The apparatus of any of claims 10-15, wherein the transmission module is further configured to:
and receiving the stream data of which the schema is the second version.
17. The apparatus of claim 16, wherein the processing module is further specifically configured to: and storing the schema into the stream data of the second version according to lines.
18. The apparatus according to any one of claims 10 to 17, wherein the field of the first version of schema and the field of the second version of schema are at least partially different, specifically comprising:
and adding or deleting fields in the second version of the schema compared with the first version of the schema, or enabling the second version of the schema to be different from the first version of the schema in default values of the same fields.
19. A stream data storage system is characterized in that the stream data storage system stores stream data with a first version of stream data schema and a second version of stream data schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; the processor is configured to:
receiving a stream data reading request sent by a consumer; the version of the schema of the streaming data used by the consumer is the second version; the stream data reading request is used for requesting the schema to be the stream data of the first version; the schema is stream data of a first version and is stored in the stream data storage system according to fields of the schema of the first version in columns;
and acquiring data from the corresponding column of the stream data of which the schema is the first version stored according to the column according to the field of the schema of the second version.
20. The streaming data storage system of claim 19, wherein the processor is further configured to:
and storing the stream data of which the schema is the first version into a column storage space indicated by the column identification corresponding to the field identification of the schema of the first version.
21. The stream data storage system according to claim 19 or 20, wherein a data read interval is included in the stream data read request, and the data read interval is used to indicate a storage location for reading the first version of the stream data in the schema.
22. The streaming data storage system of claim 21, wherein the processor is specifically configured to:
searching a target column identifier corresponding to the field identifier of the schema of the second version from the field identifier of the schema of the first version;
and reading the data stored in the column storage space indicated by the target column identification according to the reading interval.
23. The streaming data storage system of claim 21, wherein the processor is specifically configured to:
according to a conversion operator between the second version of the schema and the first version of the schema, obtaining a conversion algorithm of a field required for reading stream data of the first version of the schema according to the second version of the schema;
and reading the data of the corresponding column of the stream data of which the schema is the first version according to the conversion algorithm.
24. The streaming data storage system of any of claims 19-23, wherein the processor is further configured to:
receiving the stream data of which the schema is a first version;
and storing the stream data of which the schema is the first version according to the fields of the schema of the first version by columns.
25. The streaming data storage system of any of claims 19-24, wherein the processor is further configured to:
and receiving the stream data of which the schema is the second version.
26. The streaming data storage system of claim 25, wherein the streaming data storage system stores the schema as the second version of streaming data in rows.
27. The stream data storage system of any of claims 19-26, wherein the fields of the first version of the schema and the fields of the second version of the schema are at least partially different, in particular comprising:
and adding or deleting fields in the second version of the schema compared with the first version of the schema, or enabling the second version of the schema to be different from the first version of the schema in default values of the same fields.
28. A computer-readable storage medium containing computer instructions for use in a streaming data storage system that stores streaming data having a first version of a schema and a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; the processor executing the computer instructions causes the streaming data storage system to perform the method of streaming data access in a streaming data storage system as claimed in any one of claims 1 to 9.
29. A computer program product comprising computer instructions for use in a streaming data storage system, the streaming data storage system storing streaming data having a first version of a schema and a second version of the schema; wherein the stream data with the first version of the schema and the stream data with the second version of the schema have the same stream data Topic; the version identification of the first version and the version identification of the second version both correspond to the identification of the Topic; the schema of the first version and the schema of the second version respectively correspond to a plurality of fields; and the fields of the first version of the schema and the fields of the second version of the schema are at least partially different; the streaming data storage system comprises an interface and a processor, the interface being in communication with the processor; the computer instructions, when executed by the processor, cause the streaming data storage system to perform the method of streaming data access in a streaming data storage system of any of claims 1-9.
CN202110474997.4A 2021-02-04 2021-04-29 Stream data access method and device in stream data storage system Pending CN114860684A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/100270 WO2022166071A1 (en) 2021-02-04 2021-06-16 Stream data access method and apparatus in stream data storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110154643 2021-02-04
CN2021101546431 2021-02-04

Publications (1)

Publication Number Publication Date
CN114860684A true CN114860684A (en) 2022-08-05

Family

ID=82628058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110474997.4A Pending CN114860684A (en) 2021-02-04 2021-04-29 Stream data access method and device in stream data storage system

Country Status (2)

Country Link
CN (1) CN114860684A (en)
WO (1) WO2022166071A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8892762B2 (en) * 2009-12-15 2014-11-18 International Business Machines Corporation Multi-granular stream processing
CN107330024B (en) * 2017-06-21 2020-10-09 华为机器有限公司 Storage method and device of tag system data
CN110647512B (en) * 2019-09-29 2022-05-24 北京思维造物信息科技股份有限公司 Data storage and analysis method, device, equipment and readable medium
CN111159176A (en) * 2019-11-29 2020-05-15 中国科学院计算技术研究所 Method and system for storing and reading mass stream data

Also Published As

Publication number Publication date
WO2022166071A1 (en) 2022-08-11

Similar Documents

Publication Publication Date Title
CN109254733B (en) Method, device and system for storing data
US9678969B2 (en) Metadata updating method and apparatus based on columnar storage in distributed file system, and host
US10996858B2 (en) Method and device for migrating data
US11003625B2 (en) Method and apparatus for operating on file
US10649905B2 (en) Method and apparatus for storing data
CN111177302B (en) Service bill processing method, device, computer equipment and storage medium
CN109885577B (en) Data processing method, device, terminal and storage medium
CN112527736B (en) DNA-based data storage method, data recovery method and terminal equipment
CN112765175B (en) Interface data processing method and device, computer equipment and medium
CN114138776A (en) Method, system, apparatus and medium for graph structure and graph attribute separation design
CN115470156A (en) RDMA-based memory use method, system, electronic device and storage medium
CN114138792A (en) Key-value separated storage method and system
CN113703873A (en) Client cold start method, apparatus, medium, device, and program product
CN111158606B (en) Storage method, storage device, computer equipment and storage medium
CN111752941B (en) Data storage and access method and device, server and storage medium
CN108073709B (en) Data recording operation method, device, equipment and storage medium
CN114860684A (en) Stream data access method and device in stream data storage system
CN111435323B (en) Information transmission method, device, terminal, server and storage medium
WO2023082902A1 (en) Index creation method, computing device, and storage medium
CN107977381B (en) Data configuration method, index management method, related device and computing equipment
CN114238334A (en) Heterogeneous data encoding method and device, heterogeneous data decoding method and device, computer equipment and storage medium
CN115712581A (en) Data access method, storage system and storage node
CN111797147B (en) Data processing method and device and electronic equipment
CN109857719B (en) Distributed file processing method, device, computer equipment and storage medium
CN109284260B (en) Big data file reading method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination