CN110955659A - Method and system for processing data table - Google Patents

Method and system for processing data table Download PDF

Info

Publication number
CN110955659A
CN110955659A CN201911189241.4A CN201911189241A CN110955659A CN 110955659 A CN110955659 A CN 110955659A CN 201911189241 A CN201911189241 A CN 201911189241A CN 110955659 A CN110955659 A CN 110955659A
Authority
CN
China
Prior art keywords
field
layer
basic sample
data
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911189241.4A
Other languages
Chinese (zh)
Other versions
CN110955659B (en
Inventor
秦川
周振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202310273245.0A priority Critical patent/CN116303459A/en
Priority to CN201911189241.4A priority patent/CN110955659B/en
Publication of CN110955659A publication Critical patent/CN110955659A/en
Application granted granted Critical
Publication of CN110955659B publication Critical patent/CN110955659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method and system for processing a data table are provided. The method for processing the data table comprises the following steps: acquiring a basic sample table and at least one associated data table; and processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample. According to the method and the system for processing the data table, the data table splicing and the feature extraction can be realized by using the neural network.

Description

Method and system for processing data table
Technical Field
The present invention relates generally to the field of data processing, and more particularly, to a method and system for processing a data table.
Background
With the occurrence of mass data in various industries, various processing needs to be performed on data in more and more scenes, for example, processing such as data table splicing and feature extraction.
Although the neural network is relatively universal in the fields with relatively standard input such as images, voices and texts, a general scheme still does not exist for the machine learning problem under the scenes with various data sets and more data tables, multi-table splicing and feature extraction are usually required to be carried out manually, and then the neural network is used for machine learning model training or machine learning model prediction based on the extracted features.
Disclosure of Invention
An exemplary embodiment of the present invention is to provide a method and system for processing a data table, which can implement data table concatenation and feature extraction using a neural network of a specific structure.
According to an exemplary embodiment of the present invention, there is provided a method of processing a data table, including: acquiring a basic sample table and at least one associated data table; and processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
Optionally, the at least one association data table includes: at least one static meter and/or at least one flow meter.
Optionally, the fields in the base sample table include: at least one body ID field, a time field, and a tag field; the primary key of the at least one static table is a body ID field; the primary key of the at least one flow meter is a body ID field and a time field.
Optionally, the structure-specific neural network comprises: the method comprises the following steps of a plurality of hidden layers, a first splicing layer, a second splicing layer and a fusion layer, wherein for each basic sample in a basic sample table, a neural network with a specific structure is used for processing data related to the basic sample in at least one related data table, and the step of obtaining characteristics corresponding to the basic sample comprises the following steps: for each basic sample in the basic sample table, inputting a field value of a field belonging to each field type in a data record corresponding to the basic sample in each static table into a hidden layer corresponding to the field type in the plurality of hidden layers, wherein the output of the hidden layer corresponding to each field type in the at least one static table is used as the input of the first splicing layer; for each basic sample in the basic sample table, inputting a field value of a field belonging to each field type in a data record corresponding to the basic sample in each flow meter into a hidden layer corresponding to the field type in the plurality of hidden layers, wherein the output of the hidden layer corresponding to each field type in the at least one flow meter is used as the input of a specific hidden layer in the plurality of hidden layers through a second splicing layer; taking the vector output by the fusion layer as: features corresponding to each base sample in a base sample table, wherein an output of the first stitched layer and an output of the specific hidden layer are used as inputs of the fused layer, wherein the input of the specific hidden layer is a vector sequence and the output is a vector of fixed dimensions.
Optionally, the hidden layer corresponding to the type field in the at least one static table is: a first embedding layer, wherein an output of the first embedding layer is an embedding vector; the hidden layer corresponding to the time field in the at least one static table is: the difference layer is used for making difference between the field value of the time field in the static table and the field value of the time field in the basic sample table; the hidden layer corresponding to the type field in the at least one flow meter is as follows: a second embedding layer, wherein an output of the second embedding layer is an embedding vector.
Optionally, the hidden layer corresponding to the numeric field in the at least one static table is: a first fully-connected layer; the hidden layer corresponding to the numerical field in the at least one flow meter is as follows: a second fully connected layer.
Optionally, the step of inputting a field value of a field belonging to each field type in the data record corresponding to the basic sample in each flow meter into a hidden layer corresponding to the field type among the plurality of hidden layers includes: for each basic sample in the basic sample table, sorting the field value of at least one main body ID field in the flow table and the same data record of the basic sample according to the time field value of each flow table, and determining N data records with the time field value before the time field value of the basic sample from the sorted data records; sorting field values of the type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each type field in the flow chart, and inputting the obtained time sequences into a second embedded layer; and aiming at each numerical type field in the flow table, sorting the field values of the numerical type field in the N data records into corresponding time sequences according to the sorting of the N data records, and inputting the obtained time sequences into a second full connection layer.
Optionally, the specific hidden layer is a hidden layer composed of a long-short term memory network LSTM.
Optionally, the neural network of the specific structure further comprises: a third fully-connected layer and an output layer, wherein the output of the fused layer is input to the output layer via the third fully-connected layer, wherein the output layer is configured to output the result of the prediction.
Optionally, the body ID field in the basic sample table comprises a user ID field, and the tag field indicates credit card application rating; the at least one static table comprises at least one of: a user information table, an asset table and a people credit information table; the at least one flow meter comprises: a bank transaction flow meter.
Optionally, the body ID field in the basic sample table includes a user ID field and an article ID field, and the tag field indicates whether the user will purchase the article; the at least one static table comprises at least one of: a user information table and a commodity information table; the at least one flow meter comprises: internet user behavior record table.
According to another exemplary embodiment of the present invention, there is provided a system for processing a data table, including: the data table acquisition device is suitable for acquiring a basic sample table and at least one associated data table; and the data table processing device is suitable for processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
Optionally, the at least one association data table includes: at least one static meter and/or at least one flow meter.
Optionally, the fields in the base sample table include: at least one body ID field, a time field, and a tag field; the primary key of the at least one static table is a body ID field; the primary key of the at least one flow meter is a body ID field and a time field.
Optionally, the structure-specific neural network comprises: the data table processing device is suitable for inputting field values of fields belonging to each field type in data records corresponding to the basic sample in each static table into a hidden layer corresponding to the field type in the plurality of hidden layers aiming at each basic sample in a basic sample table, wherein the output of the hidden layer corresponding to each field type in at least one static table is used as the input of the first splicing layer; the data table processing device is suitable for inputting field values of fields belonging to each field type in data records corresponding to the basic sample in each flow meter into a hidden layer corresponding to the field type in the plurality of hidden layers aiming at each basic sample in the basic sample table, wherein the output of the hidden layer corresponding to each field type in the at least one flow meter is used as the input of a specific hidden layer in the plurality of hidden layers through a second splicing layer; the data table processing means is adapted to take the vector output by the fusion layer as: features corresponding to each base sample in a base sample table, wherein an output of the first stitched layer and an output of the specific hidden layer are used as inputs of the fused layer, wherein the input of the specific hidden layer is a vector sequence and the output is a vector of fixed dimensions.
Optionally, the hidden layer corresponding to the type field in the at least one static table is: a first embedding layer, wherein an output of the first embedding layer is an embedding vector; the hidden layer corresponding to the time field in the at least one static table is: the difference layer is used for making difference between the field value of the time field in the static table and the field value of the time field in the basic sample table; the hidden layer corresponding to the type field in the at least one flow meter is as follows: a second embedding layer, wherein an output of the second embedding layer is an embedding vector.
Optionally, the hidden layer corresponding to the numeric field in the at least one static table is: a first fully-connected layer; the hidden layer corresponding to the numerical field in the at least one flow meter is as follows: a second fully connected layer.
Optionally, the data table processing device is adapted to, for each base sample in the base sample table, sort, for each flow meter, data records of which the field value of at least one subject ID field in the flow meter is the same as that of the base sample by their time field values, and determine, from the sorted data records, N data records whose time field value is before the time field value of the base sample; sorting field values of the type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each type field in the flow chart, and inputting the obtained time sequences into a second embedded layer; and sorting the field values of the numerical type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each numerical type field in the flow chart, and inputting the obtained time sequences into a second full connection layer.
Optionally, the specific hidden layer is a hidden layer composed of a long-short term memory network LSTM.
Optionally, the neural network of the specific structure further comprises: a third fully-connected layer and an output layer, wherein the output of the fused layer is input to the output layer via the third fully-connected layer, wherein the output layer is configured to output the result of the prediction.
Optionally, the body ID field in the basic sample table comprises a user ID field, and the tag field indicates credit card application rating; the at least one static table comprises at least one of: a user information table, an asset table and a people credit information table; the at least one flow meter comprises: a bank transaction flow meter.
Optionally, the body ID field in the basic sample table includes a user ID field and an article ID field, and the tag field indicates whether the user will purchase the article; the at least one static table comprises at least one of: a user information table and a commodity information table; the at least one flow meter comprises: internet user behavior record table.
According to another exemplary embodiment of the invention, a system is provided comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method of processing a data table as described above.
According to another exemplary embodiment of the present invention, a computer-readable storage medium storing instructions is provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform the method of processing a data table as described above.
According to the method and the system for processing the data table, the data table splicing and the feature extraction are realized by using the neural network with a specific structure. Furthermore, by integrating the data table splicing and feature extraction steps into the structure of the neural network, automatic generation of a machine learning model or prediction of the machine learning model through the neural network based on the specified data table can be realized.
Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.
Drawings
The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:
FIG. 1 illustrates a flowchart of a method of processing a data table according to an exemplary embodiment of the present invention;
FIG. 2 illustrates an example of a neural network of a particular structure in accordance with an exemplary embodiment of the present invention;
FIG. 3 illustrates another example of a neural network of a particular structure in accordance with an exemplary embodiment of the present invention;
fig. 4 illustrates a block diagram of a system for processing a data table according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.
Fig. 1 illustrates a flowchart of a method of processing a data table according to an exemplary embodiment of the present invention.
Referring to fig. 1, in step S10, a base sample table and at least one associated data table are obtained.
Here, the at least one associated data table is associated with the base sample table, for example, a field value of a primary key of the at least one associated data table includes a field value of a body ID field in the base sample table.
As an example, the fields in the base sample table may include: at least one body ID field, a time field, and a label (label) field. Here, the label field is the target to be predicted using machine learning techniques; the time field may indicate a prediction time point of the corresponding base sample. For example, for a regression problem, the field value of the tag field may be one or more numerical values; for classification problems, the field value of the tag field may be a class number.
As an example, the at least one association data table may include: at least one static meter and/or at least one flow meter.
As an example, the at least one static table may be used to describe attribute information for a body ID. As an example, the primary key of the at least one static table may be a body ID field.
As an example, the primary key of the at least one flow meter may be a body ID field and a time field. For example, the flow meter may be a bank transaction flow meter, or an internet user behavior record.
In step S20, for each basic sample in the basic sample table, the neural network with a specific structure is used to process the data associated with the basic sample in the at least one associated data table, so as to obtain the feature corresponding to the basic sample.
That is, the data associated with the piece of basic sample in the at least one associated data table is input to the neural network with the specific structure by taking each basic sample in the basic sample table as an index, so as to obtain the feature corresponding to the piece of basic sample through the neural network with the specific structure, thereby achieving the effect consistent with the following operation: and splicing the at least one associated data table to a basic sample table, and extracting the characteristics of the spliced data table.
As an example, the data associated with each base sample in the at least one associated data table may include: a field value in a data record in the at least one associated data table corresponding to the piece of base sample. For example, the data record in each static table corresponding to each base sample may be: the data records in the static table having the same field value of the body ID field as the body ID field of the base sample.
As an example, data of different types of data tables among the at least one associated data table may be input to different hidden layers in the neural network of the particular structure. As an example, field values of different types of fields among the at least one associated data table may be input to different hidden layers in the neural network of the particular structure.
As an example, the corresponding machine learning model may subsequently be automatically trained based on machine learning samples composed of field values of the tag fields in each base sample and the resulting features corresponding to that base sample.
As another example, the resulting features corresponding to each base sample may then be input to a trained machine learning model to obtain a corresponding prediction result.
As an example, the method of processing a data table according to an exemplary embodiment of the present invention may be applied to a recommendation (e.g., merchandise recommendation, etc.) scenario, a credit card application scenario, etc. that requires machine learning using a data table.
As an example, when the method of processing a data table according to an exemplary embodiment of the present invention is applied to a credit card application scenario, the body ID field in the basic sample table may include a user ID field, and the tag field may indicate a credit card application score; the at least one static table may comprise at least one of: a user information table, an asset table and a people credit information table; the at least one flow meter may include: a bank transaction flow meter.
As another example, when the method of processing a data table according to an exemplary embodiment of the present invention is applied to a goods recommendation scenario, the body ID field in the base sample table may include two body ID fields of a user ID field and a goods ID field, and the tag field may indicate whether the user will purchase the goods; the at least one static table may comprise at least one of: a user information table and a commodity information table; the at least one flow meter may include: internet user behavior record table. The fields in the internet user behavior record table may include: a user ID field, an article ID field, a time field, an action field (indicating actions of browsing, purchasing, etc.), and other attribute fields (e.g., a money amount field, etc.) for describing the action.
Fig. 2 shows an example of a neural network of a specific structure according to an exemplary embodiment of the present invention.
As shown in fig. 2, the neural network of the specific structure may include: the multilayer optical fiber comprises a plurality of hidden layers, a first splicing layer, a second splicing layer and a fusion layer; the output of the hidden layer which is in one-to-one correspondence with each field type in the at least one static table is used as the input of the first splicing layer; the output of the hidden layer which is in one-to-one correspondence with each field type in the at least one flow meter is used as the input of a specific hidden layer in the plurality of hidden layers through a second splicing layer; the output of the first stitched layer and the output of the particular hidden layer are inputs to the fused layer.
Accordingly, step S20 may include: for each basic sample in the basic sample table, inputting a field value of a field belonging to each field type in a data record corresponding to the basic sample in each static table into a hidden layer corresponding to the field type in the plurality of hidden layers; inputting field values of fields belonging to each field type in data records corresponding to the basic sample in each flow meter into a hidden layer corresponding to the field type in the plurality of hidden layers aiming at each basic sample in the basic sample table; and taking the vector output by the fusion layer as: features corresponding to each base sample in the base sample table.
By way of example, the field types may include: a category type field, a numerical type field, and a time field. Specifically, the Category field may refer to a field value of a field for indicating an unordered Category (e.g., gender, occupation, region, etc.), the field value of the Category field having no size score; the numeric type field may refer to a field having a field value (e.g., an age value, an amount, etc.) as a continuity value; the time field may refer to a field value of a field for indicating time (e.g., registration date of a bank card, birthday, etc.).
As an example, the field value of the type field in the at least one association data table may be input to an Embedding layer (Embedding) in the neural network of the specific structure, wherein the output of the Embedding layer is an Embedding vector. The embedding layer can represent high-dimensional class features with fixed-dimensional vectors.
Specifically, as an example, the hidden layer corresponding to the type field in the at least one static table may be: a first embedding layer; the hidden layer corresponding to the type field in the at least one flow meter may be: a second embedding layer.
As an example, a field value of the numeric field in the at least one association data table may be input to a fully connected layer (density) in the neural network of the particular structure. The input and output of the fully-connected layer may be vectors of a specified size. The field values of the numeric field do not need to be made up into vectors before being input into the fully-connected layer, because the data table itself is structured data, and if m rows and n field values need to be input into the fully-connected layer, the input is a two-dimensional vector with the size of m x n.
Specifically, as an example, the hidden layer corresponding to the numeric field in the at least one static table may be: a first fully-connected layer; the hidden layer corresponding to the numeric field in the at least one flow meter may be: a second fully connected layer.
As an example, the hidden layer corresponding to the time field in the at least one static table may be: and the difference layer is used for making difference between the field value of the time field in the static table and the field value of the time field in the basic sample table. For example, for each base sample in the base sample table, the differencing layer may differen a time field value in the data record corresponding to the piece of base sample in each static table from a time field value in the piece of base sample.
As an example, the input of the specific hidden layer may be a sequence of vectors, and the output may be a fixed-dimension vector, thereby being able to carry timing information.
As an example, the specific hidden layer may be a hidden layer composed of a Long Short-term memory network (LSTM). The long-short term memory network structure can convert the vector sequence into a fixed-dimension vector, so that the obtained fixed-dimension vector contains sequence information. It should be understood that the specific hidden layer may be formed by other types of Neural Networks, for example, a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN), which is not limited by the present invention.
As an example, the input of the concatenation layer (Concat) in the neural network of the specific structure may be a plurality of vectors, and the output may be a new vector formed by concatenating the plurality of vectors.
As an example, the fusion layer (Merge) in the neural network of the specific structure may fuse vectors in at least one of the following fusion manners: splicing, adding and subtracting. It should be understood that other suitable fusion means may be employed.
As an example, for each base sample in the base sample table, the step of inputting a field value of a field belonging to each field type in a data record corresponding to the base sample in each flow meter into a hidden layer corresponding to the field type among the plurality of hidden layers may include: for each basic sample in the basic sample table, sequencing at least one data record with the same field value of at least one main body ID field in the flow table and the basic sample for each flow meter according to the time field value of the data record from first to last, and determining N data records with the time field value before and closest to the time field value of the basic sample from the sequenced data records; sorting field values of the type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each type field in the flow chart, and inputting the obtained time sequences into a second embedded layer; and sorting the field values of the numerical type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each numerical type field in the flow chart, and inputting the obtained time sequences into a second full connection layer. It should be understood that the time series can be input into the neural network with a specific structure to obtain corresponding time sequence characteristics.
As an example, since the base sample table may include: one or more body ID fields, when the base sample table includes a plurality of body ID fields, may define different body combinations for the plurality of body ID fields. For example, when the base sample table includes: a body ID1 field and a body ID2 field, different body combinations may include: (body ID1 field), (body ID2 field), and (body ID1 field, body ID2 field). Correspondingly, for each basic sample in the basic sample table, for each subject combination, at least one data record in each pipeline table, whose field value corresponds to the subject combination and is the same as that of the basic sample, may be sorted according to its time field value from first to last, and then a corresponding time sequence may be obtained based on the sorted data records.
As an example, when the at least one associated data table comprises only static tables, the neural network of the specific structure may comprise only: the device comprises a first embedding layer, a first full-connection layer, a difference layer, a first splicing layer and a fusion layer. For example, the fusion layer may also be omitted when the at least one associated data table includes only one static table.
As another example, when the at least one associated data table comprises only a water flow meter, the neural network of the particular structure may comprise only: a second embedding layer, a second fully-connected layer, a second splice layer, the specific hidden layer, and a fused layer. For example, the fusion layer may also be omitted when the at least one associated data table comprises only one flow meter.
Fig. 3 illustrates another example of a neural network of a particular structure according to an exemplary embodiment of the present invention.
As shown in fig. 3, the neural network of the specific structure may include, in addition to the plurality of hidden layers, the first mosaic layer, the second mosaic layer, and the fusion layer described with reference to fig. 2: a third fully-connected layer and an Output layer (Output), wherein an Output of the fused layer is an input to the Output layer via the third fully-connected layer, wherein the Output layer is configured to Output a result of the prediction. For example, for a classification problem, the predicted outcome may include a probability of belonging to each category; for regression problems, the predicted outcome may include a regression value. It should be understood that the third fully connected layer may be other network structures, and the invention is not limited in this regard. According to the exemplary embodiment of the invention, the whole process of automatically completing machine learning training or prediction through the neural network based on the specified data table can be realized.
Fig. 4 illustrates a block diagram of a system for processing a data table according to an exemplary embodiment of the present invention.
As shown in fig. 4, the system for processing a data table according to an exemplary embodiment of the present invention includes: a data sheet acquisition device 10 and a data sheet processing device 20.
In particular, the data table acquisition means 10 are adapted to acquire a base sample table and at least one associated data table.
As an example, the at least one association data table may include: at least one static meter and/or at least one flow meter.
As an example, the fields in the base sample table may include: at least one body ID field, a time field, and a tag field; the primary key of the at least one static table may be a body ID field; the primary key of the at least one flow meter may be a body ID field and a time field.
The data table processing device 20 is adapted to, for each base sample in the base sample table, process the data associated with the piece of base sample in the at least one associated data table using a neural network with a specific structure, so as to obtain a feature corresponding to the piece of base sample.
As an example, the neural network of the particular structure may include: a plurality of hidden layers, a first splicing layer, a second splicing layer and a fusion layer, wherein the data table processing apparatus 20 may be adapted to input, for each base sample in the base sample table, a field value of a field belonging to each field type in a data record corresponding to the base sample in each static table to a hidden layer corresponding to the field type among the plurality of hidden layers, wherein an output of a hidden layer corresponding to each field type in the at least one static table is used as an input of the first splicing layer; the data table processing device 20 may be adapted to, for each base sample in the base sample table, input a field value of a field belonging to each field type in the data record corresponding to the base sample in each flow meter to a hidden layer corresponding to the field type among the plurality of hidden layers, wherein an output of a hidden layer corresponding to each field type in the at least one flow meter is input to a specific hidden layer in the plurality of hidden layers via the second concatenation layer; the data table processing device 20 may be adapted to take the vector output by the fusion layer as: features corresponding to each base sample in a base sample table, wherein an output of the first stitched layer and an output of the specific hidden layer are used as inputs of the fused layer, wherein the input of the specific hidden layer is a vector sequence and the output is a vector of fixed dimensions.
As an example, the hidden layer corresponding to the type field in the at least one static table may be: a first embedding layer, wherein an output of the first embedding layer is an embedding vector; the hidden layer corresponding to the time field in the at least one static table may be: the difference layer is used for making difference between the field value of the time field in the static table and the field value of the time field in the basic sample table; the hidden layer corresponding to the type field in the at least one flow meter may be: a second embedding layer, wherein an output of the second embedding layer is an embedding vector.
As an example, the hidden layer corresponding to the numeric field in the at least one static table may be: a first fully-connected layer; the hidden layer corresponding to the numeric field in the at least one flow meter may be: a second fully connected layer.
As an example, the specific hidden layer may be a hidden layer composed of a long short term memory network LSTM.
As an example, the data table processing device 20 may be adapted to sort, for each base sample in the base sample table, field values of at least one body ID field in the flow table and the same data records of the piece of base sample by their time field values, and determine, from the sorted data records, N data records with time field values before the time field value of the piece of base sample; sorting field values of the type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each type field in the flow chart, and inputting the obtained time sequences into a second embedded layer; and sorting the field values of the numerical type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each numerical type field in the flow chart, and inputting the obtained time sequences into a second full connection layer.
As an example, the neural network of the particular structure may further include: a third fully-connected layer and an output layer, wherein the output of the fused layer is input to the output layer via the third fully-connected layer, wherein the output layer is configured to output the result of the prediction.
As an example, the body ID field in the base sample table may include a user ID field, and the tag field may indicate a credit card application score; the at least one static table may comprise at least one of: a user information table, an asset table and a people credit information table; the at least one flow meter may include: a bank transaction flow meter.
As an example, the body ID field in the base sample table may include a user ID field and an item ID field, and the tag field may indicate whether the user will purchase the item; the at least one static table may comprise at least one of: a user information table and a commodity information table; the at least one flow meter may include: internet user behavior record table.
It should be understood that the specific implementation manner of the system for processing a data table according to the exemplary embodiment of the present invention may be implemented with reference to the related specific implementation manner described in conjunction with fig. 1 to 3, and will not be described herein again.
The system for processing a data table according to an exemplary embodiment of the present invention includes devices that can be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function. These means may correspond, for example, to a dedicated integrated circuit, to pure software code, or to a module combining software and hardware. Further, one or more functions implemented by these apparatuses may also be collectively performed by components in a physical entity device (e.g., a processor, a client, a server, or the like).
It is to be understood that the method of processing a data table according to an exemplary embodiment of the present invention may be implemented by a program recorded on a computer readable medium, for example, according to an exemplary embodiment of the present invention, there may be provided a computer readable medium for processing a data table, wherein a computer program for executing the following method steps is recorded on the computer readable medium: acquiring a basic sample table and at least one associated data table; and processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
The computer program in the computer-readable medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than the above steps or perform more specific processing when the above steps are performed, and the contents of the additional steps and the further processing are described with reference to fig. 1 to 3, and will not be described again to avoid repetition.
It should be noted that the system for processing a data table according to an exemplary embodiment of the present invention may completely depend on the execution of the computer program to realize the corresponding functions, that is, each device corresponds to each step in the functional architecture of the computer program, so that the whole system is called by a special software package (e.g., lib library) to realize the corresponding functions.
On the other hand, each means included in the system for processing a data table according to an exemplary embodiment of the present invention may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, exemplary embodiments of the present invention may also be implemented as a computing device comprising a storage component and a processor, the storage component having stored therein a set of computer-executable instructions that, when executed by the processor, perform a method of processing a data table.
In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.
The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
Some of the operations described in the method of processing a data table according to an exemplary embodiment of the present invention may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.
The processor may execute instructions or code stored in one of the memory components, which may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.
Further, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.
The operations involved in a method of processing a data table according to an exemplary embodiment of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.
For example, as described above, a computing device for processing a data table according to an exemplary embodiment of the present invention may include a storage unit and a processor, wherein the storage unit stores therein a set of computer-executable instructions that, when executed by the processor, perform the steps of: acquiring a basic sample table and at least one associated data table; and processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims (10)

1. A method of processing a data table, comprising:
acquiring a basic sample table and at least one associated data table;
and processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
2. The method of claim 1, wherein,
the at least one association data table includes: at least one static meter and/or at least one flow meter.
3. The method of claim 2, wherein,
the fields in the base sample table include: at least one body ID field, a time field, and a tag field;
the primary key of the at least one static table is a body ID field;
the primary key of the at least one flow meter is a body ID field and a time field.
4. The method of claim 2, wherein the structure-specific neural network comprises: a plurality of hidden layers, a first splice layer, a second splice layer and a fusion layer,
the method comprises the following steps of for each basic sample in a basic sample table, processing data related to the basic sample in at least one related data table by using a neural network with a specific structure, and obtaining characteristics corresponding to the basic sample, wherein the step of obtaining the characteristics comprises the following steps:
for each basic sample in the basic sample table, inputting a field value of a field belonging to each field type in a data record corresponding to the basic sample in each static table into a hidden layer corresponding to the field type in the plurality of hidden layers, wherein the output of the hidden layer corresponding to each field type in the at least one static table is used as the input of the first splicing layer;
for each basic sample in the basic sample table, inputting a field value of a field belonging to each field type in a data record corresponding to the basic sample in each flow meter into a hidden layer corresponding to the field type in the plurality of hidden layers, wherein the output of the hidden layer corresponding to each field type in the at least one flow meter is used as the input of a specific hidden layer in the plurality of hidden layers through a second splicing layer;
taking the vector output by the fusion layer as: a feature corresponding to each base sample in a base sample table, wherein an output of the first stitched layer and an output of the particular hidden layer are inputs for the fused layer,
wherein the input of the particular hidden layer is a sequence of vectors and the output is a vector of fixed dimensions.
5. The method of claim 4, wherein,
the hidden layer corresponding to the type field in the at least one static table is: a first embedding layer, wherein an output of the first embedding layer is an embedding vector;
the hidden layer corresponding to the time field in the at least one static table is: the difference layer is used for making difference between the field value of the time field in the static table and the field value of the time field in the basic sample table;
the hidden layer corresponding to the type field in the at least one flow meter is as follows: a second embedding layer, wherein an output of the second embedding layer is an embedding vector.
6. The method of claim 5, wherein,
the hidden layer corresponding to the numeric field in the at least one static table is: a first fully-connected layer;
the hidden layer corresponding to the numerical field in the at least one flow meter is as follows: a second fully connected layer.
7. The method of claim 6, wherein the step of inputting, for each base sample in the base sample table, a field value of a field belonging to each field type in the data record corresponding to the base sample in each flow meter into the hidden layer corresponding to the field type among the plurality of hidden layers comprises:
for each basic sample in the basic sample table, sorting the field value of at least one main body ID field in the flow table and the same data record of the basic sample according to the time field value of each flow table, and determining N data records with the time field value before the time field value of the basic sample from the sorted data records;
sorting field values of the type fields in the N data records into corresponding time sequences according to the sorting of the N data records aiming at each type field in the flow chart, and inputting the obtained time sequences into a second embedded layer;
and aiming at each numerical type field in the flow table, sorting the field values of the numerical type field in the N data records into corresponding time sequences according to the sorting of the N data records, and inputting the obtained time sequences into a second full connection layer.
8. A system for processing a data table, comprising:
the data table acquisition device is suitable for acquiring a basic sample table and at least one associated data table;
and the data table processing device is suitable for processing the data related to the basic sample in the at least one associated data table by using a neural network with a specific structure aiming at each basic sample in the basic sample table to obtain the characteristic corresponding to the basic sample.
9. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform a method of processing a data table as claimed in any of claims 1 to 7.
10. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method of processing a data table as claimed in any one of claims 1 to 7.
CN201911189241.4A 2019-11-28 2019-11-28 Method and system for processing data table Active CN110955659B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202310273245.0A CN116303459A (en) 2019-11-28 2019-11-28 Method and system for processing data table
CN201911189241.4A CN110955659B (en) 2019-11-28 2019-11-28 Method and system for processing data table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911189241.4A CN110955659B (en) 2019-11-28 2019-11-28 Method and system for processing data table

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202310273245.0A Division CN116303459A (en) 2019-11-28 2019-11-28 Method and system for processing data table

Publications (2)

Publication Number Publication Date
CN110955659A true CN110955659A (en) 2020-04-03
CN110955659B CN110955659B (en) 2023-02-21

Family

ID=69978760

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310273245.0A Pending CN116303459A (en) 2019-11-28 2019-11-28 Method and system for processing data table
CN201911189241.4A Active CN110955659B (en) 2019-11-28 2019-11-28 Method and system for processing data table

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202310273245.0A Pending CN116303459A (en) 2019-11-28 2019-11-28 Method and system for processing data table

Country Status (1)

Country Link
CN (2) CN116303459A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035582A (en) * 2020-08-28 2020-12-04 光大科技有限公司 Structured data classification method and device, storage medium and electronic device
CN112115137A (en) * 2020-08-17 2020-12-22 贝壳技术有限公司 Data configuration method and device
CN112347102A (en) * 2020-11-03 2021-02-09 第四范式(北京)技术有限公司 Multi-table splicing method and multi-table splicing device
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112927012A (en) * 2021-02-23 2021-06-08 第四范式(北京)技术有限公司 Marketing data processing method and device and marketing model training method and device
WO2022089652A1 (en) * 2020-11-02 2022-05-05 第四范式(北京)技术有限公司 Method and system for processing data tables and automatically training machine learning model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334638A (en) * 2018-03-20 2018-07-27 桂林电子科技大学 Collaborative Filtering method based on shot and long term Memory Neural Networks and interest migration
US20180314926A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Smart memory handling and data management for machine learning networks
US20190095515A1 (en) * 2017-09-25 2019-03-28 International Business Machines Corporation Automatic feature learning from a relational database for predictive modelling
CN109697066A (en) * 2018-12-28 2019-04-30 第四范式(北京)技术有限公司 Realize the method and system of tables of data splicing and automatic training machine learning model
CN110119413A (en) * 2019-04-30 2019-08-13 京东城市(南京)科技有限公司 The method and apparatus of data fusion
CN110222199A (en) * 2019-06-20 2019-09-10 青岛大学 A kind of character relation map construction method based on ontology and a variety of Artificial neural network ensembles
CN110275903A (en) * 2019-06-28 2019-09-24 第四范式(北京)技术有限公司 Improve the method and system of the feature formation efficiency of machine learning sample

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180314926A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Smart memory handling and data management for machine learning networks
US20190095515A1 (en) * 2017-09-25 2019-03-28 International Business Machines Corporation Automatic feature learning from a relational database for predictive modelling
CN108334638A (en) * 2018-03-20 2018-07-27 桂林电子科技大学 Collaborative Filtering method based on shot and long term Memory Neural Networks and interest migration
CN109697066A (en) * 2018-12-28 2019-04-30 第四范式(北京)技术有限公司 Realize the method and system of tables of data splicing and automatic training machine learning model
CN110119413A (en) * 2019-04-30 2019-08-13 京东城市(南京)科技有限公司 The method and apparatus of data fusion
CN110222199A (en) * 2019-06-20 2019-09-10 青岛大学 A kind of character relation map construction method based on ontology and a variety of Artificial neural network ensembles
CN110275903A (en) * 2019-06-28 2019-09-24 第四范式(北京)技术有限公司 Improve the method and system of the feature formation efficiency of machine learning sample

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周咏: "基于神经网络的异构医疗数据智能适配", 《信息与电脑(理论版)》 *
李志义等: "基于表示学习的跨模态检索模型与特征抽取研究综述", 《情报学报》 *
王可等: "基于Pearson关联度BP神经网络的时间序列预测", 《光学精密工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115137A (en) * 2020-08-17 2020-12-22 贝壳技术有限公司 Data configuration method and device
CN112035582A (en) * 2020-08-28 2020-12-04 光大科技有限公司 Structured data classification method and device, storage medium and electronic device
WO2022089652A1 (en) * 2020-11-02 2022-05-05 第四范式(北京)技术有限公司 Method and system for processing data tables and automatically training machine learning model
CN112347102A (en) * 2020-11-03 2021-02-09 第四范式(北京)技术有限公司 Multi-table splicing method and multi-table splicing device
CN112347102B (en) * 2020-11-03 2024-05-07 第四范式(北京)技术有限公司 Multi-table splicing method and multi-table splicing device
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112434032B (en) * 2020-11-17 2024-04-05 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112927012A (en) * 2021-02-23 2021-06-08 第四范式(北京)技术有限公司 Marketing data processing method and device and marketing model training method and device

Also Published As

Publication number Publication date
CN116303459A (en) 2023-06-23
CN110955659B (en) 2023-02-21

Similar Documents

Publication Publication Date Title
CN110955659B (en) Method and system for processing data table
Bell Machine learning: hands-on for developers and technical professionals
WO2019192261A1 (en) Payment mode recommendation method and device and equipment
James Pattern recognition
WO2023011382A1 (en) Recommendation method, recommendation model training method, and related product
CN109408829B (en) Method, device, equipment and medium for determining readability of article
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN108664637A (en) A kind of search method and system
CN113761219A (en) Knowledge graph-based retrieval method and device, electronic equipment and storage medium
CN116601626A (en) Personal knowledge graph construction method and device and related equipment
CN110781925A (en) Software page classification method and device, electronic equipment and storage medium
CN113989476A (en) Object identification method and electronic equipment
CN115018588A (en) Product recommendation method and device, electronic equipment and readable storage medium
US20220366138A1 (en) Rule-based machine learning classifier creation and tracking platform for feedback text analysis
CN116883181B (en) Financial service pushing method based on user portrait, storage medium and server
US20240012809A1 (en) Artificial intelligence system for translation-less similarity analysis in multi-language contexts
CN113468421A (en) Product recommendation method, device, equipment and medium based on vector matching technology
CN115203194A (en) Metadata information generation method, related device, equipment and storage medium
CN117251761A (en) Data object classification method and device, storage medium and electronic device
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
US20220172271A1 (en) Method, device and system for recommending information, and storage medium
CN114281990A (en) Document classification method and device, electronic equipment and medium
CN113807920A (en) Artificial intelligence based product recommendation method, device, equipment and storage medium
CN113434660A (en) Product recommendation method, device, equipment and storage medium based on multi-domain classification
US11941076B1 (en) Intelligent product sequencing for category trees

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant