Disclosure of Invention
The purpose of the application is to provide a structured data processing method and device, which can better solve the problem that data cannot be effectively used under the conditions of large static feature loss rate, sparse user behaviors and the like, so that modeling can be performed by using data of a sparse matrix, and the data utilization rate is improved.
To solve the above problems, the present application discloses a structured data processing method, including:
obtaining structured data and preprocessing the structured data;
converting the preprocessed structured data into text data, and splicing the text data to obtain a natural language corresponding to the structured data;
generating a vector corresponding to the structured data based on the natural language;
and training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
In a preferred embodiment, after the step of training the text model based on the vector, adjusting parameters of the neural network to obtain a score for adding to the tree model, and intercepting a partial vector of the neural network to take the intercepted partial vector of the neural network as a variable for adding to the tree model, the method further comprises: the variable and the score are added to the tree model.
In a preferred embodiment, in the step of intercepting the partial vector of the neural network, the partial vector is intercepted from a full connection layer, or a recurrent neural network, or a long-short-term memory network, or a convolutional neural network.
In a preferred embodiment, the structured data is one or any combination of the following: the number of night gambling transactions, the amount of night transactions, whether there is a short time of cashback at the time of transactions.
In a preferred embodiment, in the step of generating the vector corresponding to the structured data based on the natural language, the vector is generated by any one of the following means: word2vec, or cw2vec, or cwe.
In a preferred embodiment, in the step of training a text model based on the vector, adjusting parameters of a neural network to obtain a score for adding to a tree model, and intercepting a partial vector of the neural network to take the intercepted partial vector of the neural network as a variable for adding to the tree model, the method further comprises: the neural network is used for parameter tuning to obtain an optimal parameter, and the parameter refers to the neuron number of the neural network.
The application also discloses a structured data processing apparatus, comprising:
and a pretreatment module: the method comprises the steps of obtaining structured data and preprocessing the structured data;
and a conversion module: the method comprises the steps of converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
and a vector generation module: a vector corresponding to the structured data is generated based on the natural language;
the adjustment participation intercepting module: the method is used for training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
In a preferred embodiment, the method further comprises:
and (3) a joining module: for adding the variable and the score to the tree model.
The application also discloses a generating device for the variable of the model construction, comprising:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the method as described hereinbefore when executing the computer-executable instructions.
The application also discloses a computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the steps in the method as described hereinbefore.
Compared with the prior art, the method and the device convert the structured data into the text, obtain the natural language and obtain the corresponding vectors, and the vectors are used for being added into a tree model or other models. This has the advantage that the description of all objects can be written compactly to perfectly solve the sparseness problem, enabling a more extensive use of the data, since the missing values can be ignored as described above.
In the present application, a number of technical features are described in the specification, and are distributed in each technical solution, which makes the specification too lengthy if all possible combinations of technical features (i.e. technical solutions) of the present application are to be listed. In order to avoid this problem, the technical features disclosed in the above summary of the present application, the technical features disclosed in the following embodiments and examples, and the technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (these technical solutions are all regarded as being already described in the present specification) unless such a combination of technical features is technically impossible. For example, in one example, feature a+b+c is disclosed, in another example, feature a+b+d+e is disclosed, and features C and D are equivalent technical means that perform the same function, technically only by alternative use, and may not be adopted simultaneously, feature E may be technically combined with feature C, and then the solution of a+b+c+d should not be considered as already described because of technical impossibility, and the solution of a+b+c+e should be considered as already described.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, it will be understood by those skilled in the art that the claimed invention may be practiced without these specific details and with various changes and modifications from the embodiments that follow.
It should be noted that, the inventors of the present application found that, because of the static features missing from the industry users above 90%, the user behavior was also very sparse, and the behavior log (e.g., whether there was gambling complaint or not, various forbidden web browses) was too heterogeneous, which easily resulted in a one-hot explosion.
In this case, the modeling cannot be performed using a sparse matrix, so that the data of these coefficient matrices cannot be used for modeling, and the data cannot be used more widely.
Therefore, the method and the device aim to make the static feature missing rate large, and under the condition of sparse user behaviors, the modeling can be performed by more effectively utilizing the data of the sparse matrix, so that the data utilization rate is improved.
The innovation point of the application at least comprises:
automatic feature engineering is performed in a natural language processing manner. Specifically, structured data- > natural language- > is converted back to vectors that are used to add to a tree model or other model.
This has the advantage that in case of a large static feature loss rate, the missing values in the sparse matrix can be ignored as described above, and thus the descriptions of all objects can be compactly written to effectively solve the user behavior sparseness problem.
Furthermore, even if the service changes, the model structure does not need to be modified, and only the new module needs to be replaced by the old module. Because of the nature of the attention mechanism or deconvolution neural network (CNN), the order is negligible, so that the problem of sparse features is effectively solved, modeling can be performed by using the data of the sparse matrix, and the data utilization rate is improved.
Further, since the input features can be naturally linguistic, the predicted target can also be a segment of natural language, so the training target is more flexible.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
A first embodiment of the present application relates to a structured data processing method, and it should be noted that this embodiment has a specific application scenario:
first, a sparse matrix is used. For example, in the embodiments of the present application, the static features of the industrial users are missing more than 90%, the user behavior is very sparse, and too many kinds of behavior logs (e.g., whether there are gambling complaints, various forbidden web browses) lead to one-hot explosion.
Second, prediction cannot be performed with a simple loss function (loss function). For example, a new and old user grouping model for preventing and controlling the possibility of whether a certain user is engaged in gambling or not.
Third, comprehensiveness of the features. For example, the depiction of various ways or dimensions of features is particularly important where expert experience or business understanding often fails to exhaust all possible.
It should be noted that in this embodiment, there is one of the above cases, that is, it can be considered that the specific application scenario referred to in this application is satisfied.
In the above specific application scenario, as shown in fig. 1, the structured data processing method of the present embodiment includes the following steps:
step 110: and obtaining the structured data and preprocessing the structured data.
It is noted that structured data refers to data logically expressed and implemented by a two-dimensional table structure. Structured data strictly conforms to data format and length specifications, and is stored and managed mainly through relational databases. Whereas unstructured data refers to data that is not suitable for presentation by a two-dimensional table of a database, as opposed to structured data, such as office documents, XML, HTML, various types of reports, pictures, and audio, video information, etc.
For example, in embodiments of the present application, the structured data may be behavior sequence data of a user. The behavior sequence data of the user refers to behavior data generated by the user for items according to time sequence. For example, the behavior sequence data of the user is "register-create substitution-change login name-unbind handset".
It should be noted that the behavior sequence data described above is characterized by an indefinite length.
In this case, it is not suitable to store the table, and on the other hand, it is not suitable to analyze the table by a conventional analysis method for the table from the analysis point of view.
Further, in the application scenario of the embodiment, the structured data has specific meaning, such as the number of night gambling transactions, the amount of night transactions, whether there is a short time of cashback at the time of transactions, and so on.
For example, in the above table, "XXX", "DDD" is a typical structural feature.
Note that in this step, preprocessing refers to conventional preprocessing such as missing value padding, BBB column (e.g., unique id) removal, time normalization processing, and the like.
For example, the "unique id removal" process refers to, for example, for a BBB column (i.e., unique id), that column has virtually no meaning and should be removed in advance, and therefore, removed in a preprocessing step.
Also for example, "time normalization" processing refers to, for example, modifying "2018-08-11" to "20180811".
For another example, the "missing value filling" process refers to filling in "0" or average value, for example, if a list exists "nan (i.e., meaning" null ").
Step 120: and converting the preprocessed structured data into text data, and splicing the text data to obtain the natural language corresponding to the structured data.
Note that in this step, splicing means that text data obtained by converting structured data is spliced together, and finish_first_type 04,task_amount 2.7 in the above table is spliced into finish_first_type 04; task_amount2.7, the structured feature of each line corresponds to the text of each line concat together.
Step 130: and generating a vector corresponding to the structured data based on the natural language.
It should be noted that the core of steps 120-130 is to convert the structured data into text data and obtain corresponding natural language, and then generate a vector corresponding to the structured data based on the natural language.
Where vector (Word enabling) is a generic term for a set of language modeling and feature learning techniques in Word embedded Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers.
Conceptually, a vector involves mathematical embedding from a space of one dimension per word to a space of successive vectors with lower dimensions. Methods of generating such mappings include neural networks, dimension reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context of word occurrences.
Word and phrase embedding has been demonstrated to improve the performance of Natural Language Processing (NLP) tasks, such as grammar analysis and emotion analysis, when used as an underlying input representation.
It should be noted that the benefit of using structured features as features of natural language processing is that the model can be made to attempt to understand the structured variables, in other words, evolve the features into natural language for machine reading.
For example, the following structural features will be:
7932 1,3,7,3 8877 0,2,1,1
wherein, the meaning of each feature is as follows in turn:
7932 shows "Wangkai"
1 indicates "sex Man"
3 represents "number of night transactions 3 (more night transactions)":
7 represents "number of complaints 7 (more complaints)",
38877 represents "Lin Xin"
0 indicates "sex woman"
2 represents "number of night transactions 2 (more night transactions)":
1 represents "complaint number 1"
1 represents "no risk pool hit"
Correspondingly, the following natural language is converted:
male with the sex of the user's king, more transactions at night, more complaints, lin Xin sex women hit the gambling library, no transaction in the last 30 days, one complaint in the past, no risk library hit
Namely 7932 1,3,7,38877 0,2,1,1- > for men with the king sex, more transactions are performed at night, more complaints are performed, the users hit the gambling library for Lin Xin sex, more transactions are performed at night, one complaint is performed before, and no risk library hit occurs.
It should be noted that the above steps are considered: for feature "7932 1,3,7,38877 0,2,1,1", if hit various risk libraries are exploded because of the dimension, because the dimension considering single heat (one-hot) can be high, such as 1,2,3,4,9 columns become 5 columns, there are several unique values and how many columns there are, if there are 500 unique values, 500 columns, in which case the dimension is large resulting in consuming a lot of memory and computing resources, so it is difficult to perform single heat (one-hot) or linear processing, and the behavior log is also not legislated to be in a unified format for the model. While natural language does not present such a problem.
In the specific scenario, the features are converted into natural language through the steps, then the deep learning or other natural language processing tools are used for generating the features, and in the subsequent steps, the features are imported into the tree model, so that the advantage is that, as described above, in the prior art, the dimension is large, so that a large amount of memory and calculation resources are consumed, single-hot (one-hot) or linear processing is difficult to perform, and the behavior log is not regulated to be in a unified format for the model, but in the application, the structured features are converted into the natural language, the problem of large dimension is avoided, and therefore, the model which is not fully utilized before can be utilized, because many data are sparse, and the model is directly discarded without gain.
It should be noted that in the embodiments of the present application, there are various ways to generate the vector, for example, word2vec, cw2vec, cwe, or with other neural networks.
In other words, the variable can be generated as a feature to be generated by the deep learning model, thereby quantizing the information into a feature gain model effect. That is to say,
step 140: and training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, and intercepting part of the vector of the neural network to take the intercepted part of the vector of the neural network as a variable for adding the tree model.
Specifically, in the present embodiment, the partial vector may be intercepted from any layer of the neural network, for example, a full connection layer, or a recurrent neural network (GRU), or a long short-term memory network (LSTM), or a Convolutional Neural Network (CNN), or the like.
In particular, the fully connected layer is a marker space for mapping the learned feature representation to the sample. Each node of the full connection layer is connected with all nodes of the upper layer and is used for integrating the features extracted by the front edge. The parameters of the fully connected layer are also generally the most due to their fully connected nature.
Specifically, the interception function, i.e. the subtstr function, in the embodiment of the present application, the interception setting may be to take out the corresponding full connection layer.
It should be noted that the tree model is a model which is most widely used in the field of machine learning, and is a model which is particularly diverse, in addition to deep learning. The tree model is mainly used for classification.
It is noted that embodiments of the present application may be used for types of tree models, including, but not limited to, lightgbm, xgboost, random forest, gbdt, decision tree models.
Specifically, this step further includes a sub-step of tuning parameters using the neural network, through which an optimal parameter is obtained.
Wherein, the parameter refers to the number of neurons of the network and the learning rate of the network. In the tree model, the depth of the tree and the number of leaf nodes are referred to. These parameters may cover different network structures and numbers of neurons, and by solving for the number of neurons and the network structure, a benign parameter is obtained, and a best-performing model is called out of these parameters.
Further, in this case, the fully connected layers are truncated using the fully connected layer truncation and used as variables of the tree model.
It should be noted that, in other embodiments of the present application, the code for intercepting the full connection layer code is not limited thereto, and other options are also possible, which will not be described herein.
Step 150: adding the variable and the score to the tree model.
Step 160: and constructing a data model deployment platform based on the tree model.
In other words, in steps 105-106, after adding the variable and the score to the tree model, based on the tree model with the added variable and score obtained in the above steps, accessing MPS deployment, and deploying by referring to a stacking model.
Note that MPS (Meta Programming System) is a language programming oriented tool. The developer may use it to extend the programming language, or it may use it to create Domain Specific Languages (DSLs) for enterprise applications. MPS maintains code with an Abstract Syntax Tree (AST). An AST is made up of nodes, which in turn contain attributes, child nodes, and references, by which program code is expressed in its entirety. When creating a new language, the developer defines rules for coding and expression, and may also specify constituent elements of the language type system. MPS also adopts the code generation approach: expressed at a higher level in the new language, the MPS then generates compilable code in the language Java, XML, HTML, javaScript, etc.
It should be noted that Stacking generally refers to training a multi-layer (e.g., two-layer) learner architecture, where a first layer (also called a learning layer) incorporates the predicted results into a new feature set using n different classifiers (or models of different parameters) and serves as input to the next layer classifier. Since the raw data may be cluttered, in stacking, valid features are learned after passing through the learners of the first layer. From this point of view, the first layer of stacking is the process of feature extraction. The upper row is data which is not stacked, the lower row is data which is processed by stacking (a plurality of unsupervised learning algorithms), and through experiments, the data is obviously found to be more obvious in demarcation in the lower row. Thus, efficient stacking can effectively extract features in the raw data.
Specifically, in this step, the online deployment scheme of the MPS, the whole set of deployment flow is as follows:
first, training the tree model with the variables added on the prediction platform. Then, a deployment link is arranged on the MPS, for example, a preprocessing module is the same as that in training, a neural network is used for generating a corresponding hidden layer, and the hidden layer is added into the variable of the tree model. And finally, deploying the MPS package to a model management platform server.
In this way, the first layer model can be introduced separately into the second layer model, after which predictions can be made online.
It should be noted that according to the above-described manner, the link average prediction time is only 70ms, which is very fast, through testing.
This embodiment has the following advantages:
first, since the missing values can be ignored as described above, descriptions of all targets can be compactly written, thereby solving the user behavior sparseness problem, more widely using data.
Moreover, since the input features can be naturally linguisticd, the predicted target can also be a segment of natural language, so the training target is more elastic.
A second embodiment of the present application relates to a structured data processing apparatus, the structure of which is shown in fig. 2, the structured data processing apparatus comprising: the system comprises a preprocessing module, a conversion module, a vector generation module and a call participation interception module, and specifically comprises the following steps:
and a pretreatment module: the method comprises the steps of obtaining structured data and preprocessing the structured data;
and a conversion module: the method comprises the steps of converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
and a vector generation module: a vector corresponding to the structured data is generated based on the natural language;
the adjustment participation intercepting module: and the method is used for training a text model based on the vector, adjusting parameters of a neural network, obtaining a score for adding a tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
Further, the above structured data processing apparatus may further include:
and (3) a joining module: for adding the variable and the score to the tree model.
Note that the first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the first embodiment.
It should be noted that, as will be understood by those skilled in the art, the implementation functions of the modules shown in the embodiments of the above structured data processing apparatus may be understood by referring to the description related to the foregoing structured data processing method. The functions of the modules shown in the above-described embodiments of the structured data processing apparatus may be implemented by a program (executable instructions) running on a processor, or may be implemented by a specific logic circuit. The above-described structured data processing apparatus according to the embodiments of the present application may also be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the prior art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Accordingly, the present description also provides a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method embodiments of the present description. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
In addition, the embodiment of the application also provides a generating device for the variable of the model construction, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the method embodiments described above when executing computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present invention may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
It should be noted that in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that an action is performed according to an element, it means that the action is performed at least according to the element, and two cases are included: the act is performed solely on the basis of the element and is performed on the basis of the element and other elements. Multiple, etc. expressions include 2, 2 times, 2, and 2 or more, 2 or more times, 2 or more.
All references mentioned in this specification are to be considered as being included in the disclosure of this specification in their entirety so as to be applicable as a basis for modification when necessary. Furthermore, it should be understood that the foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.