CN114676138A

CN114676138A - Data processing method, electronic device and readable storage medium

Info

Publication number: CN114676138A
Application number: CN202210317485.1A
Authority: CN
Inventors: 谢超; 葛希; 龙际全; 栾小凡
Original assignee: Shanghai Xuyu Intelligent Technology Co ltd
Current assignee: Shanghai Xuyu Intelligent Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-06-28

Abstract

The present application relates to the field of computer technologies, and in particular, to a data processing method, an electronic device, and a readable storage medium. The data processing method comprises the following steps: receiving a data query request; determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset; and determining target index information matched with the data query request and a target data entity corresponding to the target index information in the first information and the second information of the plurality of index information sets of the target data set. The data processing method in the embodiment of the application increases the storage and query of the character string type data and enriches the query capability of the data entity.

Description

Data processing method, electronic device and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, an electronic device, and a readable storage medium.

Background

With the development of computer technology, the data volume is more and more, and the difficulty of data processing is also increasing continuously. Wherein the data includes structured data and unstructured data. Structured data is typically represented as a two-dimensional table structure, and may be, for example, numeric, date, string data, and the like. Unstructured data is typically treated as vectors, and unstructured data may be, for example, pictures, video, text data, and the like.

For unstructured data, a computer generally processes the unstructured data into vector type data, and when data storage is performed on a data entity containing both structured data and unstructured data, the data entity is divided into a non-character string type scalar column and a non-character string type vector column in a storage area for storage, or is divided into a non-character string type scalar row or a non-character string type vector row in a storage area for persistent storage, and an index of the vector column is constructed. When data query is needed, filtering conditions about vector columns are generated according to query requests, and vector column data are filtered according to indexes of constructed vector columns. For example, for a face image, data processing is performed to obtain unstructured data of a vector type representing the face data, and then combined with structured data of a scalar type representing the age of the face image, the unstructured data is stored as a vector column representing the face data and a scalar column representing the age of the year, wherein the face image and the corresponding age thereof can be used as a data entity.

According to the data processing method, when the data entity is queried, only the query of non-character-string scalar type data and vector type data is supported, the query data types are few, the data query efficiency is low, and the query capability of a computer on the data entity is low.

Disclosure of Invention

The embodiment of the application provides a data processing method, electronic equipment and a readable storage medium, which increase the storage and query of character string type data and enrich the query capability of data entities.

In a first aspect, an embodiment of the present application provides a data processing method, which is used for an electronic device, and includes:

receiving a data query request;

determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subsets comprise a plurality of data entities, and a part of the data entities in the plurality of data entities comprise first data of a character string type and second data of a non-character string type; and the index information set at least comprises first information for characterizing the correspondence between first data in the data subset and a string index, and second information for characterizing the correspondence between the string index in the data subset and the data entity;

and determining target index information matched with the data query request and a target data entity corresponding to the target index information in the first information and the second information of a plurality of index information sets of the target data set.

According to the data processing method provided by the embodiment of the application, through querying the target data set comprising the character string data, the query of the character string data of the unstructured data can be realized, and the queryable data types are expanded. Meanwhile, the query of the character string data and the non-character string data can be carried out simultaneously, and the query capability of the data entity is enriched.

In a possible implementation manner of the first aspect, the first information used for characterizing the correspondence between the first data in the data subset and the string index is dictionary tree information.

In a possible implementation manner of the first aspect, the data query request includes third data of a string type;

the determining, from the first information and the second information of multiple index information sets of the target data set, target index information that matches the data query request and a target data entity corresponding to the target index information includes:

determining state information for the at least one subset of data of the target dataset, the state information including a sequestration state and a growth state;

if the at least one data subset is in the sealed state, determining a target character string index corresponding to the third data as the target index information according to the dictionary tree information;

and determining the target data entity corresponding to the first target character string index according to the second relation.

In a possible implementation manner of the first aspect, the determining, according to the dictionary tree information, that the target string index corresponding to the third data is the target index information includes:

looking up the third data in the dictionary tree information;

and determining the target character string index corresponding to the third data as the target index information according to the dictionary tree information.

In a possible implementation manner of the first aspect, the determining, in the first information and the second information of multiple index information sets of the target data set, target index information that matches the data query request and a target data entity corresponding to the target index information further includes:

and determining the target data entity as a data query result corresponding to the data query request.

In a possible implementation manner of the first aspect, the data query request further includes fourth data of a non-character string type.

In a possible implementation manner of the first aspect, the data query request includes a query condition, and the query condition includes at least one of the following:

a Boolean expression;

characterizing prefix matching conditions of the character string prefixes;

an exact match condition of the string is characterized.

In a second aspect, an embodiment of the present application provides a data processing method, which is used for an electronic device, and includes:

receiving a data insertion request, wherein the data insertion request comprises a character string set to be inserted, and the character string set comprises a plurality of character string data;

and responding to the data insertion request, writing the plurality of character string data into corresponding data subsets of a target data set respectively, wherein each data subset comprises at least one data entity, and the at least one data entity comprises at least one character string data.

It can be understood that, in the data processing method provided in the embodiment of the present application, by constructing corresponding dictionary tree information and mapping information for the string data of the data subsets, the construction of the index of the string data can be quickly and effectively achieved, and the constructed mapping relationship can be used for implementing the construction of the index of the field column of the non-string data, thereby providing a basis for more types of scalar indexes.

In a possible implementation manner of the second aspect, the method further includes:

acquiring the at least one character string data in the data subset;

according to the at least one character string data, constructing third information of the data subset, wherein the third information is used for representing the corresponding relation between the at least one character string data in the data subset and a character string index;

and according to each character string data in the first information, the corresponding character string index and the data entity corresponding to each character string data, constructing fourth information for representing the corresponding relation between the character string index and the data entity in the data subset.

In a possible implementation manner of the second aspect, the writing the plurality of character string data into the corresponding data subsets in response to the data insertion request includes:

in response to the data insertion request, dividing the plurality of character string data into M character string subsets, and sending the character string subsets to corresponding M data nodes, wherein M is greater than or equal to 2;

and respectively writing the character string data in the character string subsets of the M data nodes into the corresponding data subsets.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; one or more memories; the one or more memories store one or more programs that, when executed by the one or more processors, cause the electronic device to perform the data processing method of the first or second aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which instructions are stored, and when executed on a computer, the instructions cause the computer to perform the data processing method of the first aspect or the second aspect.

In a fifth aspect, the present application provides a computer program product, which includes computer programs/instructions, and when the computer programs/instructions are executed by a processor, the data processing method of the first aspect or the second aspect is implemented.

Drawings

FIG. 1 illustrates a schematic diagram that exemplarily shows a data query system, in accordance with some embodiments of the present application;

FIG. 2 illustrates a diagram that schematically shows a structure of a data table, in accordance with some embodiments of the present application;

FIG. 3 is a schematic flow chart diagram that illustrates a method of data processing, according to some embodiments of the present application;

FIG. 4 is a flow diagram that illustrates data insertion, according to some embodiments of the present application;

FIG. 5 is a flow diagram that schematically illustrates data insertion, in accordance with some embodiments of the present application;

FIG. 6 is a schematic diagram illustrating an exemplary process for building an index according to some embodiments of the present application;

FIG. 7 is a flow diagram that illustrates a process for building an index according to some embodiments of the application;

FIG. 8 is a schematic flow chart diagram that illustrates a method of data processing, according to some embodiments of the present application;

FIG. 9 is a diagram illustrating an interaction process that illustratively shows a data query process, according to some embodiments of the present application;

FIG. 10 is an interactive process diagram that illustratively depicts another data query process, in accordance with some embodiments of the present application;

fig. 11 is a diagram illustrating an exemplary hardware configuration of an electronic device according to some embodiments of the present application.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before describing the aspects of the present application, some concepts and terms related to the present application will be described below with reference to fig. 1 and 2 to facilitate understanding of the aspects of the present application.

Fig. 1 illustrates a data query system according to an embodiment of the present application.

As shown in FIG. 1, the data query system includes an access component 1, a coordination service component 2, a message storage component 3, an execution component 4, and an object storage module 5.

The agent component 1 is configured to receive requests such as external data insertion/deletion/query, and return data processing results to the outside. The agent component 1 comprises a plurality of agent modules, such as the agent module 11 and the agent module 12 in fig. 1.

The coordination service component 2 is configured to assign data processing tasks to the respective modules of the execution component 4, and store state information and the like of the respective modules. The orchestration service component 2 may include an orchestration management module 21, an orchestration query module 22, an orchestration data module 23, an orchestration index module 24, and a metadata storage module 25 that stores information such as the status of the executive components 4.

The message storage component 3 is configured to receive a Data Management Language (DML) command corresponding to a request, such as Data addition/insertion/deletion/query, sent by the agent component 1, and temporarily store the DML command in a log form into the log storage module 31 of the message storage component 3.

The execution component 4 is configured to execute the data processing task assigned by the coordination service component 2 and the DML command initiated by the proxy module of the proxy component 1, and write the data stored in the log storage module 31 into the object storage module 5 in the form of a binary log (binlog). The new-to-high component 4 is further configured to obtain a binlog file stored in the object storage module, construct an index for data in the binlog file, and store the index into the object storage module 5.

The object storage module 5 is used for storing the binlog file and the index constructed by the execution component 4. In some embodiments, a binlog file may be stored in log file 51 in object storage module 5, and an index of scalars/vectors may be stored in index file 53 in object storage module 5, where scalars include numeric data, string data, date data, and boolean data.

The data stored in the object storage module 5 will be described below with reference to fig. 2. Fig. 2 is a schematic structural diagram of a data table in the embodiment of the present application.

As shown in FIG. 2, in some embodiments, the data in object storage module 5 is stored as a data table 6. The data table 6 includes a plurality of data partitions, such as a data partition 601, a data partition 602, and the like. Wherein each data partition may include multiple data segments, e.g., data partition 601 includes data segment 1 and data segment 2, and data partition 602 includes data segment 3, data segment 4, and data segment 5. Each data segment comprises a data fields, wherein each row of data fields in the row a of data fields is a data entity, namely corresponds to the same unstructured data, and each column of data in the column b is of the same data type. Each data partition may divide data according to a rule defined by a user, for example, divide a plurality of data partitions according to dates, divide a plurality of data partitions according to a location of the user, and the like.

For example, taking the unstructured data as a face image as an example, the computer readable data corresponding to one face image is a data entity. The data entity corresponding to each face image can comprise an integer column field formed by age, a vector column field formed by face data and a character string field of a label representing the face data, and the computer-readable data corresponding to the 4 face images are stored as a group, wherein the integer column field formed by age and the character string field of the label representing the face data are scalar column fields. Then for 10 additional face images, 3 data segments may be represented in the data table 6, i.e. 4 x 3 data fields may be included in each data segment. The label of the face data may be, for example, a name of a person corresponding to the face image.

In some embodiments, each data segment includes a × b data fields, where each of the b columns of data fields is a data entity, i.e., corresponds to the same unstructured data, and each of the a rows of data is of the same data type.

With continued reference to FIG. 1, in some embodiments, the execution component 4 includes at least one query module 41, at least one data module 42, and at least one indexing module 43.

The data module 42 is configured to receive data stored in the log storage module 31, and write the data in the log storage module 5 (i.e., the log file 51) in a binary log (binlog) form. If the data are stored in different field columns according to different data types, the log file may include a scalar column and a vector column.

The query module 41 is configured to receive query operation data stored in the log storage module 31, load the log file 51 in the object storage module 5 into the memory of the query module 41, and then complete data query based on the query operation data. The query module 41 may implement search, hybrid search, query functionality.

It will be appreciated that searching means performing a neighbor search on the vector column field and returning the k data entities that match the query conditions most closely. The hybrid search means that at least two columns of data fields including a vector column field among the character-string-type scalar column field, the vector column field, and the non-character-string-type scalar column field are respectively searched in a preset search order. The query represents the filtering of the non-string scalar column fields and the string scalar column fields, and the return of the data entities corresponding to the non-string scalar column fields or the string fields that meet the query criteria.

For example, assume that each data entity represents a face image, and each data entity includes an integer column field characterizing age, a vector column field characterizing face data, and a string field characterizing name. The searching may be to query a face image most similar to the target face, and at this time, neighbor searching may be performed on the face data, and a data entity corresponding to k personal face data most similar to the target face is returned. The hybrid search may be to query a face image having a name a and most similar to the target face, at this time, a character string corresponding to the name may be filtered according to the query condition to obtain the name a, then the filtering result is involved in neighbor search of the face data, and k face images having a name a and most similar to the target face in the neighbor search result are returned. The query may be to query the first 5 columns of data in the data table 6, at this time, data segments corresponding to the first 5 columns of data in the data table 6 (each data segment includes 3 columns of data fields) may be filtered out according to the query condition, and then the face image corresponding to the filtered data fields is returned.

As described in the foregoing background, in the existing data processing method, the query capability for the non-structural data is not high, and in order to solve the above problem, the present application discloses a data processing method, which can be applied to a data query system, and a data table in the data query system stores character string data and an index information set corresponding to the character string data of each data segment. Specifically, the method comprises the following steps: receiving a data query request; determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subsets comprise a plurality of data entities, and part of the data entities in the plurality of data entities comprise first data of a character string type and second data of a non-character string type; the index information set at least comprises first information used for representing the corresponding relation between the first data in the data subset and the character string index and second information used for representing the corresponding relation between the character string index in the data subset and the data entity; and determining target index information matched with the data query request and a target data entity corresponding to the target index information in the first information and the second information of the plurality of index information sets of the target data set.

In an embodiment of the present application, the data query request may be a request received from the outside, and the request may be used for querying a data entity, and the data entity includes at least one of structured data and unstructured data, wherein the structured data is stored in a database, and implemented data may be logically expressed by a two-dimensional table structure, and each data has a specific meaning, such as integer data, character string data, and the like. Unstructured data refers to data that are irregular in data structure, do not have a uniform predefined data model, and are not conveniently represented by a database two-dimensional logical table. Unstructured data includes pictures, video, audio, natural language, etc., accounting for 80% of the total amount of all data. The processing of unstructured data may be processed by converting into vector data through various Artificial Intelligence (AI) or Machine Learning (ML) models.

It is to be appreciated that in some embodiments, the target data set is data table 6 in fig. 2, the data subsets are data fragments in fig. 2, such as data fragment 1 through data fragment 5, each data entity corresponds to one unstructured data, the first data and the second data can be data fields in the data fragments, and the first data and the second data are stored in binlog form.

It is to be understood that the second data of the non-character string type may be a scalar field of the non-character string type, such as an integer field and a floating point field, or may be a vector field, such as a floating point vector field, and the like, which is not limited in this application.

It is understood that the index information set includes index information of data of a character string type (hereinafter, character string data) and index information of data of a non-character string type.

According to the data processing method provided by the embodiment of the application, the storage of the character string data is added in the target data set, and the first information for representing the corresponding relation between the first data in the data subset and the character string index is constructed corresponding to the character string data. When the data entity is queried, the query of vector data and non-character string type scalar data can be realized, the query of character string data can be realized based on the first information, and the queryable data type of the data entity is expanded. Meanwhile, the query of the character string data and the non-character string data can be carried out simultaneously, and the query capability of the unstructured data is enriched.

In addition, the index information set in the embodiment of the present application further includes second information used for representing a correspondence between a string index in the data subset and a data entity, and based on the second information, a query on a scalar type string field column can be realized, and an index can be constructed on scalar type data.

In order to support the query of the string data, the data processing method provided in the embodiment of the present application needs to complete the storage of the string data and the construction of the index. The storage process of data and the construction process of index in the embodiment of the present application are described below with reference to fig. 3.

Fig. 3 is a schematic flow chart illustrating data storage and index construction in the embodiment of the present application.

As shown in FIG. 3, the process of data storage and recipient indexing includes:

301: a data insertion request is received. The data insertion request comprises a character string set to be inserted, and the character string set comprises a plurality of character string data. The character string set to be inserted is character string type scalar data.

It can be understood that, when the agent module in the data query system receives the data insertion request, the agent module generates a data insertion command and data to be inserted according to the data insertion request, and stores the data insertion command and the data to be inserted in the log storage module 31 in the form of a log.

In some embodiments, non-string data, such as non-string type scalar data, vector data, etc., may also be included in the data insertion request. For example, if the data insertion request is to insert a face image, the data insertion request may include integer scalar data representing an age, vector data representing face data, and character string data representing a name.

302: in response to the data insertion request, the plurality of character string data are respectively written into the corresponding data subsets. Each data subset comprises at least one data entity, and at least one data entity comprises at least one character string data.

It is understood that the data subset includes at least one data entity corresponding to the unstructured data, and each data entity includes a character string data as a character string field of one data entity. It is understood that after the data module 42 in the data query system receives the data insertion task generated by the coordination service component 2, the data module 42 will read the data to be inserted from the log storage module 31 and store the data to be inserted and the data insertion operation in the form of binlog into the object storage module 5.

In the embodiment of the present application, the storage space occupied by each data subset is preset. In some embodiments, if the data entity corresponding to the character string data to be inserted does not exist in the data table, the plurality of character string data are respectively written into the corresponding data subsets, that is, the plurality of character string data are sequentially written into the character strings in the corresponding data entities in the data subsets, starting from the data subset in which data can be currently stored, and after the storage of the character string data of the current data subset reaches the upper limit, the rest of the character string data to be stored are written into the next data subset. For example, taking the storage upper limit of the data subsets as 4 data entities and 10 images to be inserted as an example, if the data entities corresponding to the 10 images to be inserted do not exist in the data table 6, the position of the most previous data subset which does not reach the storage upper limit is determined, then the character string data of the 10 images are sequentially written into the data subsets, and the data writing of the next data subset is started until the character string data of the data subsets reaches the storage upper limit 4.

In other embodiments, if the data entity corresponding to the character string data to be inserted already exists in the data table, the plurality of character string data are written into the corresponding data subsets, that is, the query module 41 performs data query to determine the position of the data subset corresponding to the data entity in the data table 6, and then the plurality of character string data are sequentially inserted into the character string in the corresponding data entity. For example, taking the storage upper limit of the data subset as 4 data entities and 10 images to be inserted as an example, if the data entities corresponding to the 10 images to be inserted already exist in the data table 6, the positions of the data entities of the 10 images are determined, and then the character string data of the 10 images are sequentially written into the character strings in the corresponding data entities.

In some embodiments, the storage space occupied by different data subsets may be the same or different, and this application does not limit this.

In some embodiments, non-string data may also be included in each data entity as non-string fields of the data entity, such as non-string scalar fields and vector fields including integer scalar fields, floating point scalar fields, and the like.

In some embodiments, 302 may specifically include: responding to a data insertion request, dividing a plurality of character string data into M character string subsets, and sending the character string subsets to corresponding M data nodes, wherein M is more than or equal to 2; and respectively writing each character string data in the character string subsets of the M data nodes into the corresponding data subsets.

It will be appreciated that the data node is the data module 42 in fig. 1. In the embodiment of the present application, the plurality of character string data is divided into M small-batch data, that is, into M character string subsets, and then when data storage is performed, the M character string subsets may be simultaneously sent to the corresponding data modules 42, and data storage is completed. 302 will be further described in conjunction with fig. 4 and 5.

Fig. 4 and 5 are flowcharts illustrating a data storage process in the embodiment of the present application.

As shown in fig. 4, in some embodiments, the data to be inserted includes an ID of the data entity, string data, and vector data. The ID therein may also be understood as the primary key hash value of the data entity.

Specifically, the broker module divides the data entity into two small batches of data (batch data 1 and batch data 2) according to the primary key hash value of the data entity and stores them into corresponding data nodes (nodes s1 and s2) in the log storage module 31, each of which may be served by one query module or data module. The node s1 corresponds to the query module 411 and the data module 421, and the node s2 corresponds to the query module 412 and the data module 422.

The data module 421 may obtain batch data 1 stored on node s1, the data module 422 may obtain batch data 2 stored on node s2, and the data module 421 and the data module 422 may simultaneously persist batch data 1 and batch data 2 into the object storage module 5.

As shown in fig. 5, in some embodiments, data of different nodes may be transmitted to the data module through different data transmission channels for persistent storage of data.

For example, batch data 1 of node s1 may be transmitted to data module 421 through channel A, and data module 421 writes batch data 1 into the corresponding plurality of data segments 61. Meanwhile, the batch data 2 of the node s2 may be transmitted to the data module 422 through the channel B, and the data module 422 writes the batch data 2 into the corresponding plurality of data segments 61.

It will be appreciated that each row of data in the data segment 61 represents one data entity, i.e., a corresponding piece of unstructured data. A plurality of fields, such as an id field (integer scalar field), a string field (string scalar field), and a vector field, may be included in each data entity. When the data size of one data segment reaches the upper storage limit, the data entity is written into the next data segment 61. The upper memory limit may be, for example, 512 MB.

303: at least one string data in the subset of data is obtained.

It is to be understood that at least one string data of a data byte is obtained, i.e. string data in a subset of data is obtained. In some embodiments, the retrieved string data may include a string field and a row offset of the string field relative to the first row data entity of each data subset.

For example, if the data subset includes the string field column S as shown in fig. 6, the acquired string data includes the string field column S and the corresponding row offset list R. Wherein the row offset list consists of the row offsets of each string field in the string field column S relative to the first row string field of the data subset.

It is understood that the indexing module 43 loads at least one character string data of the data subset from the object storage module 5 and stores it in the memory of the indexing module 43.

304: and according to the at least one character string data, constructing third information of the data subset, wherein the third information is used for representing the corresponding relation between the at least one character string data in the data subset and the character string index.

In some embodiments, the third information is dictionary tree information. It is understood that the indexing module 43 constructs the dictionary tree information according to the string data of the data subset loaded in the memory.

It can be understood that the dictionary tree information regards each character string field in the data subset as a character sequence, and then a tree structure from top to bottom is constructed according to the sequence of the character sequences. Each edge in the tree structure corresponds to a character, and each child node of the dictionary tree may be represented as an index to a sequence of characters from the root node to the character between the child nodes.

For example, as shown in fig. 6, for the string field column S, a set of string fields (a, ab, abc, ac, acd, acc, accd, acdd) is included. The indexing module 43 constructs a dictionary tree T (i.e., the dictionary tree information above) from the set of string fields, with the character "a" as the root node, with the remaining characters in the data subset as edges, and with the tree ID (i.e., the string index) as child nodes. Where the edges of the dictionary tree T include (c, b, c, b, dd, c, d) and the tree ID includes 0 through 7. The character sequence composed of the characters stored in the edge connecting the root node and any one of the child nodes in the dictionary tree T represents one of the character string fields in the character string field column S.

305: and according to each character string data and the corresponding character string index in the first information and the data entity corresponding to each character string data, fourth information of the data subset is constructed, wherein the fourth information is used for representing the corresponding relation between the character string index and the data entity in the data subset.

It will be appreciated that in some embodiments, the indexing module 43 constructs fourth information for the index and data entities based on the constructed dictionary tree information and the row offsets of the respective string fields relative to the first row of data entities.

In some embodiments, the fourth information (hereinafter, mapping) may be characterized as a correspondence of a tree ID in the dictionary tree information to the data entity. For example, the mapping relationship may be characterized as an index ID in FIG. 6, which is equal to the number of rows of the string field S. Each row of data in the index ID represents a tree ID of the row of the character string field in the dictionary tree T, the row offset of the row where the tree ID is located relative to the first tree ID in the index ID represents the row offset of the character string field corresponding to the tree ID relative to the first character string field in the character string field column.

In some embodiments, after the data module 42 performs the

above steps

301 and 302 to complete the writing of one data subset, the index module 43 may perform the steps 303 to 305, and the data module 42 may complete the writing of the data of the next data subset.

The process of index building for the data subsets is further described below in conjunction with FIG. 7.

Fig. 7 is a flowchart illustrating an index for constructing string data according to an embodiment of the present application.

In some embodiments, as shown in FIG. 7, data segments 611 and data segments 612 are stored in object storage module 5. The data fragment 611 includes a plurality of data entities, each of which includes two field columns: field column 1 and field column 2, the fields in each field column being stored as data fields in the form of binlogs, e.g., Log 1 and Log 2 in field column 1, Log 3 and Log 4 in field column 2. The data fragment 612 includes a plurality of data entities, each of which includes two field columns: field column 3 and field column 4, the fields in each field column being stored as data fields in binlog form, e.g., log 5 in field column 3, log 6 in field column 4.

The indexing module 43 loads the string field columns (assuming that the field column 2 and the field column 4 are string field columns) in the data segments 611 and 612 into the memory of the indexing module 43, and executes the index construction task 1 and the index construction task 2 corresponding to the string field columns of the data segments 611 and 612. Specifically, indexing module 43 performs a build index task 1, builds an index for field column 2, and generates dictionary tree information 5311 and mapping information 5321. Indexing module 43 performs a build index task 2 to build an index for field columns 4 and to generate dictionary tree information 5312 and mapping information 5322.

After the index building is completed, the indexing module 43 stores the generated dictionary tree information and mapping information into the index file 53 of the data segment corresponding to the object storage module 5. Specifically, the dictionary tree information 5311 and the mapping information 5321 are stored in the index file corresponding to the data fragment 611 in the object storage module 5, and the dictionary tree information 5312 and the mapping information 5322 are stored in the index file corresponding to the data fragment 612 in the object storage module 5.

It can be understood that, in the embodiment of the present application, corresponding dictionary tree information and mapping information are constructed for each data subset, and in order to more clearly illustrate the positive effects of the above index construction method, the present application also takes a machine configuration example of a central processing unit with 12 cores intel (r) core (tm) i7-8700 @ CPU 3.20GHz and a storage space 32G, and tests the time overhead and space overhead of index construction of character string data with different data volumes, and the test results are shown in table 1 below.

TABLE 1

It can be seen from table 1 that when the number of rows of the character string is kept constant and the length of the character string is increased, the index building time, the dictionary tree loading time, the index data amount, the character string data amount, and the prefix query time are all greatly increased. When the number of serial characters increases and the length of a character string increases, index construction time, index data volume, character string data volume and prefix query time all increase greatly, and index construction may fail when the number of lines is too large or the character string is too long.

It can be understood that as the length and the number of rows of the string data increase, the time overhead and the space overhead of the string index increase greatly, and the index may fail to be constructed.

In the embodiment of the application, the target data set is divided into the plurality of data subsets, and then the dictionary tree information and the mapping information are respectively constructed for the character string fields in each data subset, so that the construction time of the index information set can be shortened, the memory space occupied by each dictionary tree information is reduced, and the possibility of index construction failure is reduced.

In some embodiments, the indexing module 43 may further obtain vector field columns in the data subset, and construct an index for the vector field columns according to a preset index type. The index type can be, for example, Flat Lattice transform (Flat), Inverted Product Quantization (IVF _ PQ), etc.

The data query process in the embodiment of the present application is further described below with reference to fig. 8.

Fig. 8 is a flowchart illustrating a data query in an embodiment of the present application.

As shown in FIG. 8, in some embodiments, the process of data querying includes:

801: a data query request is received.

It can be understood that after the agent module receives the external data query request, the agent module generates a data query command and a data query condition according to the data query request, and the coordination processing component 2 can generate a corresponding data query task. The agent module stores the data query command and the data query condition in the log storage module 31 in the form of a log. The query module 41 receives the data query task generated by the coordination processing component 2 and acquires the data query condition from the log storage module 31.

In some embodiments, the data query condition includes a query of string data. For example, the data query condition in the data query request is to query a face image with the name a. In other embodiments, the data query condition may also include a query of non-string data, such as non-string scalar data including integer scalar fields, floating point scalar fields, and the like, vector data, and the like. For example, the data query request is to query the face data closest to the target face data.

802: the method includes determining a target data set corresponding to a query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset.

Wherein the data subset includes a plurality of data entities, and a portion of the plurality of data entities includes first data of a string type and second data of a non-string type. And, the index information set at least includes first information for characterizing a correspondence between first data in the data subset and the string index, and second information (i.e., mapping information in the foregoing, hereinafter referred to as mapping information) for characterizing a correspondence between the string index and the data entity in the data subset. In some embodiments, the first information may be characterized as dictionary tree information.

It is understood that the query module 41 may load the data subset of the target data set from the object storage module 5 into the memory of the query module 41 when receiving the data query task.

In some embodiments, the subset of data includes two states, a growth state and a sequestration state. Furthermore, the data subsets loaded into the memory by the query module 41 include two types: a growing data subset and a sealed data subset. And when the data amount of the growth data subset exceeds a preset storage upper limit or data is not written in a preset time period, the growth data subset is converted into a sealed data subset. For queries that grow subsets of data, a brute force approach may be employed. The sealed data subset does not support data writing, only data deletion can be achieved, the index module 43 acquires character string data in the sealed data subset to construct an index, and the query module can query the sealed data subset by using the data processing method provided in the embodiment of the application.

In some embodiments, the data loaded into its memory by query module 41 also includes a set of index information for the subset of data. Further, the dictionary tree information in the index file 53 loaded by the query module 41 may support "═ and! The query conditions include "," > "," < "", prefix lookup, and precise lookup.

In some embodiments, the query condition of the query module 41 may be represented as a boolean expression, for example, the query condition is string field > "abc". In some embodiments, the query condition may also be expressed as specifying a string prefix, e.g., the query condition is that the prefix of the string field is "abc". In some embodiments, the query condition of the query module may also be represented as a filter condition of the non-string field column, e.g., the query condition is the selection of the first 100 columns of the non-string scalar field column.

In some embodiments, the index information set includes dictionary tree information characterizing the correspondence of the string field columns in the data subset to the string indices and mapping information characterizing the correspondence of the string indices in the data subset to the data entities. For example, the dictionary tree information in the index information set may be represented as the dictionary tree T shown in fig. 6, and the mapping information may be represented as the index ID shown in fig. 6. Further, in some embodiments, the mapping relationship may also represent a correspondence of a vector index corresponding to the vector field and the data entity.

803: and determining target index information matched with the data query request and a target data entity corresponding to the target index information in the first information and the second information of the plurality of index information sets of the target data set.

It is understood that the query module 41 may determine the target index information matching the data query condition in the index information set based on the loaded index information set.

In some embodiments, the data query request is a search, and the corresponding query condition is a query to the vector field. In other embodiments, the data query request is a hybrid search, and the corresponding query conditions include queries for at least two of a string field column (i.e., a string-type scalar column), a vector field column, and a non-string-type scalar field column. In other embodiments, the data query request is a query that corresponds to a query condition that is a query of a scalar field column (including a string field column and a non-string scalar field column). The data query request and query process will be further described below in conjunction with fig. 9 and 10.

In some embodiments, the data query request is a query on a string field column, and the query module 41 determines a target data entity corresponding to the target string index information according to the mapping information in the index information set. For example, the query module 41 filters out a tree ID matching the query condition in the data query request from the dictionary tree information, and then the query module 41 determines the row offset of the character string field corresponding to the filtered tree ID relative to the first character string field in the data subset according to the mapping relationship between the character string field corresponding to the representation tree ID and the row offset of the first character string field in the data subset, and determines the data entity meeting the query condition.

In some embodiments, the data query request is a query to the vector field column, and the query module 41 may determine the target data entity corresponding to the target vector index information according to mapping information of a corresponding relationship between the vector index corresponding to the token field in the index information and the data entity.

In some embodiments, the data entities that meet the query condition queried by the query module 41 may be used as the query result and returned to the external device that initiated the data query request through the proxy module.

Fig. 9 is a schematic view of an interaction process of data query according to an embodiment of the present application.

As shown in fig. 9, in some embodiments, the data query request is a query, the query condition 901 includes string data, and the query module 41 may determine an index list 902 formed by string indexes matching the query condition 901 according to the dictionary tree information 531 loaded from the object storage module 5, where the index list 902 is target index information. Further, the query module 41 generates a bitmap 903 representing whether each row of data in the data subset meets the query condition according to the mapping information 532 loaded from the object storage module 5. In the bitmap 903, a value of 1 indicates that the string field corresponding to the value meets the query condition, and the data entity corresponding to the value can be returned as a query result; a value of 0 indicates that the string field corresponding to the value does not meet the query condition, and the data entity corresponding to the value cannot be returned as the query result.

For example, the dictionary tree information in the index information set may be represented as the dictionary tree T shown in fig. 6, the mapping information may be represented as the index ID shown in fig. 6, and the query condition is data with a string prefix of "ac", the generated index list 902 may be (1,4,3,7,5), and the bitmap 903 generated according to the mapping information 532 may be represented as (0,0,0,1,1,1,1), that is, the data entities in the string field column S that meet the query condition are row 4 to row 8 data entities. The query module 41 may return the row 4 to 8 data entities in the data subset to the outside of the originating data query request through the proxy module.

In some embodiments, the data query request in fig. 9 is a hybrid search, the query condition 901 may further include vector data, and after obtaining the bitmap 903, the query module 41 may select a vector field of a data entity to which a string field corresponding to a value of 1 belongs, participate in a neighbor query of the vector data, and return data entities corresponding to k vector fields that meet the query condition 901.

It can be understood that the query condition corresponding to the hybrid search may include at least two of a plurality of scalar data and vector data, and is not limited to the above example, and the details of the present application are not repeated herein.

In other embodiments, the data query request is a search, the query condition 901 includes vector data, and the query module 41 may determine, according to the index information of the vector data loaded from the object storage module 5, an index list composed of vector indexes matching the query condition through neighbor search, where the index list is target index information. Further, the query module 41 generates a bitmap representing whether each row of data in the vector field column in the data subset meets the query condition according to the mapping information loaded from the object storage module 5. In the bitmap, the value of 1 indicates that the vector field corresponding to the value meets the query condition, and the data entity corresponding to the value can be returned as the query result; a value of 0 indicates that the vector field corresponding to the value does not meet the query condition, and the data entity corresponding to the value cannot be returned as the query result.

For example, the data entity is a face image, the vector field represents face data, and the query condition is to search for face data matching the target face data. The query module 41 obtains index information of the face data in the data subset, determines k individual face data most similar to the target face data through neighbor search, and returns a face image corresponding to the k individual face data as a query result to the outside of the request for initiating data query.

Fig. 10 is a schematic diagram illustrating an interaction process of another data query according to an embodiment of the present application.

As shown in fig. 10, in some embodiments, the data query request is a query, and the data query request includes a query condition 1001. The query 1001 includes non-string scalar data therein, and the query module 41 may determine a row offset list 1002 of row offsets of non-string scalar fields matching the query 1001 with respect to a first non-string scalar field in the data subset, based on the data subset loaded from the object storage module 5. Further, the query module 41 determines, according to the mapping information 532 loaded from the object storage module 5, a string index corresponding to the row offset in the row offset list 1002, determines, according to the dictionary tree information, a string 1003 corresponding to the string index, and returns a data entity to which the string 1003 belongs as a query result to the outside of the request for initiating a data query.

For example, the dictionary tree information in the index information set may be represented as the dictionary tree T shown in fig. 6, the mapping information may be represented as the index ID shown in fig. 6, and the query condition 1001 is that the non-string scalar data >100, then the query module 41 first filters the non-string scalar field columns of the data subset by using the query condition 1001, and filters out the row offset list 1002 of the non-string scalar field that meets the query condition 1001. The query module 41 may then determine a string index corresponding to the row offset list based on the mapping information loaded from the object storage module 5, and then determine a string 1003 corresponding to the string index based on the dictionary tree information loaded from the object storage module 5. The query module 41 may return the data entities corresponding to the strings 1003 meeting the query condition 1001 in the data subset to the outside of the data query request through the agent module.

It is understood that query module 41 maps a query corresponding to a non-string type scalar field column to a query of string indices to string fields, and then obtains a string 1003 corresponding to the string indices based on the dictionary tree information. The mapping relation can also be understood as an index of a character string type scalar field column, and the query of the character string type scalar field column can be realized.

Fig. 11 shows a schematic structural diagram of an electronic device 1100 according to an embodiment of the present application, and the electronic device 1100 may include a processor 1110, an internal memory 1120, an interface module 1130, a power supply module 1140, and a wireless communication module 1150.

It is to be appreciated that the electronic device 1100 includes, but is not limited to, a cell phone (including a folding screen cell phone), a tablet, a laptop, a desktop, a server, a wearable, a head-mounted display, a mobile email device, a car-mounted device, a portable game player, a portable music player, a reader device, a television having one or more processors embedded therein or coupled thereto, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation to the electronic device 1100. In other embodiments of the present application, electronic device 1100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 1110 may include one or more processing units, such as: processor 1110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural network processor, among others. The different processing units may be separate devices or may be integrated into one or more processors.

A memory may also be provided in processor 1110 for storing instructions and data. In an embodiment of the present application, the processor 1110 may execute the data processing method of the present application.

Internal memory 1120 may be used to store computer-executable program code, including instructions. The internal memory 1120 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area may store data (e.g., audio data, a phonebook, etc.) created during use of the electronic device 1100, and the like.

The interface module 1130 may be used to connect an external storage device, such as an external hard disk, to extend the storage capability of the electronic apparatus 1100. The external hard disk communicates with the processor 1110 through the interface module 1130 to implement a data storage function.

The power module 1140 is used to access the power grid and supply power to the processor 1110, the internal memory 1120, and the like.

The wireless communication module 1150 may provide a solution for wireless communication including a Wireless Local Area Network (WLAN) (e.g., a wireless fidelity (Wi-Fi) network), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like, which is applied to the electronic device 1100.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one example embodiment or technology disclosed herein. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The present disclosure also relates to an operating device for performing the method. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, Application Specific Integrated Circuits (ASICs), or any type of media suitable for storing electronic instructions, and each may be coupled to a computer system bus. Further, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

Moreover, the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, the present disclosure is intended to be illustrative, but not limiting, of the scope of the concepts discussed herein.

Claims

1. A data processing method for an electronic device, comprising:

receiving a data query request;

determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subsets comprise a plurality of data entities, and a part of the data entities in the plurality of data entities comprise first data of a character string type and second data of a non-character string type; and the index information set at least comprises first information for characterizing the correspondence of the first data and the string index in the data subset, and second information for characterizing the correspondence of the string index and the data entity in the data subset;

2. The data processing method according to claim 1, wherein the first information for characterizing the correspondence between the first data in the data subset and the string index is dictionary tree information.

3. The data processing method according to claim 2, wherein the data query request includes third data of a string type;

the determining, from the first information and the second information of multiple index information sets of the target data set, target index information matching the data query request and a target data entity corresponding to the target index information includes:

4. The data processing method according to claim 3, wherein the determining, according to the dictionary tree information, that a target string index corresponding to the third data is the target index information includes:

looking up the third data in the dictionary tree information;

5. The data processing method according to claim 3, wherein the determining, among the first information and the second information of the index information sets of the target data set, a target index information that matches the data query request and a target data entity corresponding to the target index information further comprises:

6. The data processing method of claim 1, wherein the data query request further includes fourth data of a non-string type.

7. The data processing method of claim 1, wherein the data query request comprises a query condition, the query condition comprising at least one of:

a Boolean expression;

characterizing prefix matching conditions of the character string prefixes;

an exact match condition of the string is characterized.

8. A data processing method for an electronic device, comprising:

9. The data processing method of claim 8, further comprising:

acquiring the at least one character string data in the data subset;

and according to each character string data in the first information, the corresponding character string index and the data entity corresponding to each character string data, fourth information of the data subset is constructed, wherein the fourth information is used for representing the corresponding relation between the character string index and the data entity in the data subset.

10. The data processing method according to claim 8, wherein said writing the plurality of character string data into the corresponding data subsets in response to the data insertion request comprises:

11. An electronic device, comprising:

a memory for storing instructions for execution by one or more processors of the electronic device, an

A processor, being one of processors of an electronic device, for controlling execution of the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.

12. A computer-readable storage medium, characterized in that the storage medium has stored thereon instructions which, when executed on a computer, cause the computer to carry out the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.

13. A computer program product, characterized in that it comprises instructions for implementing the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.