CN110162558B - Structured data processing method and device - Google Patents

Structured data processing method and device Download PDF

Info

Publication number
CN110162558B
CN110162558B CN201910258145.4A CN201910258145A CN110162558B CN 110162558 B CN110162558 B CN 110162558B CN 201910258145 A CN201910258145 A CN 201910258145A CN 110162558 B CN110162558 B CN 110162558B
Authority
CN
China
Prior art keywords
structured data
vector
neural network
model
tree model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910258145.4A
Other languages
Chinese (zh)
Other versions
CN110162558A (en
Inventor
袁锦程
王维强
许辽萨
赵闻飙
席云
易灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201910258145.4A priority Critical patent/CN110162558B/en
Publication of CN110162558A publication Critical patent/CN110162558A/en
Application granted granted Critical
Publication of CN110162558B publication Critical patent/CN110162558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of computers and discloses a structured data processing method and device. The method comprises the following steps: obtaining structured data and preprocessing the structured data; converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data; generating a vector corresponding to the structured data based on the natural language; and training a text model based on the vector, adjusting parameters of a neural network to obtain a score for adding a tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.

Description

Structured data processing method and device
Technical Field
The present application relates to the field of computer technology.
Background
Currently, model building techniques have been widely used in the industry.
In particular, all processes that describe causal or interrelationships of a system with models belong to modeling. The means and methods of accomplishing this are also diverse, as the relationships described vary. For example, modeling can be performed according to the mechanism of things through analysis of the motion law of the system itself; it can also be modeled by the processing of experimental or statistical data of the system and based on existing knowledge and experience with the system; several methods may also be used simultaneously.
However, some problems still exist in the current model construction.
Disclosure of Invention
The purpose of the application is to provide a structured data processing method and device, which can better solve the problem that data cannot be effectively used under the conditions of large static feature loss rate, sparse user behaviors and the like, so that modeling can be performed by using data of a sparse matrix, and the data utilization rate is improved.
To solve the above problems, the present application discloses a structured data processing method, including:
obtaining structured data and preprocessing the structured data;
converting the preprocessed structured data into text data, and splicing the text data to obtain a natural language corresponding to the structured data;
generating a vector corresponding to the structured data based on the natural language;
and training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
In a preferred embodiment, after the step of training the text model based on the vector, adjusting parameters of the neural network to obtain a score for adding to the tree model, and intercepting a partial vector of the neural network to take the intercepted partial vector of the neural network as a variable for adding to the tree model, the method further comprises: the variable and the score are added to the tree model.
In a preferred embodiment, in the step of intercepting the partial vector of the neural network, the partial vector is intercepted from a full connection layer, or a recurrent neural network, or a long-short-term memory network, or a convolutional neural network.
In a preferred embodiment, the structured data is one or any combination of the following: the number of night gambling transactions, the amount of night transactions, whether there is a short time of cashback at the time of transactions.
In a preferred embodiment, in the step of generating the vector corresponding to the structured data based on the natural language, the vector is generated by any one of the following means: word2vec, or cw2vec, or cwe.
In a preferred embodiment, in the step of training a text model based on the vector, adjusting parameters of a neural network to obtain a score for adding to a tree model, and intercepting a partial vector of the neural network to take the intercepted partial vector of the neural network as a variable for adding to the tree model, the method further comprises: the neural network is used for parameter tuning to obtain an optimal parameter, and the parameter refers to the neuron number of the neural network.
The application also discloses a structured data processing apparatus, comprising:
and a pretreatment module: the method comprises the steps of obtaining structured data and preprocessing the structured data;
and a conversion module: the method comprises the steps of converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
and a vector generation module: a vector corresponding to the structured data is generated based on the natural language;
the adjustment participation intercepting module: the method is used for training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
In a preferred embodiment, the method further comprises:
and (3) a joining module: for adding the variable and the score to the tree model.
The application also discloses a generating device for the variable of the model construction, comprising:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the method as described hereinbefore when executing the computer-executable instructions.
The application also discloses a computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the steps in the method as described hereinbefore.
Compared with the prior art, the method and the device convert the structured data into the text, obtain the natural language and obtain the corresponding vectors, and the vectors are used for being added into a tree model or other models. This has the advantage that the description of all objects can be written compactly to perfectly solve the sparseness problem, enabling a more extensive use of the data, since the missing values can be ignored as described above.
In the present application, a number of technical features are described in the specification, and are distributed in each technical solution, which makes the specification too lengthy if all possible combinations of technical features (i.e. technical solutions) of the present application are to be listed. In order to avoid this problem, the technical features disclosed in the above summary of the present application, the technical features disclosed in the following embodiments and examples, and the technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (these technical solutions are all regarded as being already described in the present specification) unless such a combination of technical features is technically impossible. For example, in one example, feature a+b+c is disclosed, in another example, feature a+b+d+e is disclosed, and features C and D are equivalent technical means that perform the same function, technically only by alternative use, and may not be adopted simultaneously, feature E may be technically combined with feature C, and then the solution of a+b+c+d should not be considered as already described because of technical impossibility, and the solution of a+b+c+e should be considered as already described.
Drawings
FIG. 1 is a flow diagram of a structured data processing method according to a first embodiment of the present application;
fig. 2 is a schematic structural view of a structured data processing apparatus according to a second embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, it will be understood by those skilled in the art that the claimed invention may be practiced without these specific details and with various changes and modifications from the embodiments that follow.
It should be noted that, the inventors of the present application found that, because of the static features missing from the industry users above 90%, the user behavior was also very sparse, and the behavior log (e.g., whether there was gambling complaint or not, various forbidden web browses) was too heterogeneous, which easily resulted in a one-hot explosion.
In this case, the modeling cannot be performed using a sparse matrix, so that the data of these coefficient matrices cannot be used for modeling, and the data cannot be used more widely.
Therefore, the method and the device aim to make the static feature missing rate large, and under the condition of sparse user behaviors, the modeling can be performed by more effectively utilizing the data of the sparse matrix, so that the data utilization rate is improved.
The innovation point of the application at least comprises:
automatic feature engineering is performed in a natural language processing manner. Specifically, structured data- > natural language- > is converted back to vectors that are used to add to a tree model or other model.
This has the advantage that in case of a large static feature loss rate, the missing values in the sparse matrix can be ignored as described above, and thus the descriptions of all objects can be compactly written to effectively solve the user behavior sparseness problem.
Furthermore, even if the service changes, the model structure does not need to be modified, and only the new module needs to be replaced by the old module. Because of the nature of the attention mechanism or deconvolution neural network (CNN), the order is negligible, so that the problem of sparse features is effectively solved, modeling can be performed by using the data of the sparse matrix, and the data utilization rate is improved.
Further, since the input features can be naturally linguistic, the predicted target can also be a segment of natural language, so the training target is more flexible.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
A first embodiment of the present application relates to a structured data processing method, and it should be noted that this embodiment has a specific application scenario:
first, a sparse matrix is used. For example, in the embodiments of the present application, the static features of the industrial users are missing more than 90%, the user behavior is very sparse, and too many kinds of behavior logs (e.g., whether there are gambling complaints, various forbidden web browses) lead to one-hot explosion.
Second, prediction cannot be performed with a simple loss function (loss function). For example, a new and old user grouping model for preventing and controlling the possibility of whether a certain user is engaged in gambling or not.
Third, comprehensiveness of the features. For example, the depiction of various ways or dimensions of features is particularly important where expert experience or business understanding often fails to exhaust all possible.
It should be noted that in this embodiment, there is one of the above cases, that is, it can be considered that the specific application scenario referred to in this application is satisfied.
In the above specific application scenario, as shown in fig. 1, the structured data processing method of the present embodiment includes the following steps:
step 110: and obtaining the structured data and preprocessing the structured data.
It is noted that structured data refers to data logically expressed and implemented by a two-dimensional table structure. Structured data strictly conforms to data format and length specifications, and is stored and managed mainly through relational databases. Whereas unstructured data refers to data that is not suitable for presentation by a two-dimensional table of a database, as opposed to structured data, such as office documents, XML, HTML, various types of reports, pictures, and audio, video information, etc.
For example, in embodiments of the present application, the structured data may be behavior sequence data of a user. The behavior sequence data of the user refers to behavior data generated by the user for items according to time sequence. For example, the behavior sequence data of the user is "register-create substitution-change login name-unbind handset".
It should be noted that the behavior sequence data described above is characterized by an indefinite length.
In this case, it is not suitable to store the table, and on the other hand, it is not suitable to analyze the table by a conventional analysis method for the table from the analysis point of view.
Further, in the application scenario of the embodiment, the structured data has specific meaning, such as the number of night gambling transactions, the amount of night transactions, whether there is a short time of cashback at the time of transactions, and so on.
Figure BDA0002014424010000051
For example, in the above table, "XXX", "DDD" is a typical structural feature.
Note that in this step, preprocessing refers to conventional preprocessing such as missing value padding, BBB column (e.g., unique id) removal, time normalization processing, and the like.
For example, the "unique id removal" process refers to, for example, for a BBB column (i.e., unique id), that column has virtually no meaning and should be removed in advance, and therefore, removed in a preprocessing step.
Also for example, "time normalization" processing refers to, for example, modifying "2018-08-11" to "20180811".
For another example, the "missing value filling" process refers to filling in "0" or average value, for example, if a list exists "nan (i.e., meaning" null ").
Step 120: and converting the preprocessed structured data into text data, and splicing the text data to obtain the natural language corresponding to the structured data.
Note that in this step, splicing means that text data obtained by converting structured data is spliced together, and finish_first_type 04,task_amount 2.7 in the above table is spliced into finish_first_type 04; task_amount2.7, the structured feature of each line corresponds to the text of each line concat together.
Step 130: and generating a vector corresponding to the structured data based on the natural language.
It should be noted that the core of steps 120-130 is to convert the structured data into text data and obtain corresponding natural language, and then generate a vector corresponding to the structured data based on the natural language.
Where vector (Word enabling) is a generic term for a set of language modeling and feature learning techniques in Word embedded Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers.
Conceptually, a vector involves mathematical embedding from a space of one dimension per word to a space of successive vectors with lower dimensions. Methods of generating such mappings include neural networks, dimension reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context of word occurrences.
Word and phrase embedding has been demonstrated to improve the performance of Natural Language Processing (NLP) tasks, such as grammar analysis and emotion analysis, when used as an underlying input representation.
It should be noted that the benefit of using structured features as features of natural language processing is that the model can be made to attempt to understand the structured variables, in other words, evolve the features into natural language for machine reading.
For example, the following structural features will be:
7932 1,3,7,3 8877 0,2,1,1
wherein, the meaning of each feature is as follows in turn:
7932 shows "Wangkai"
1 indicates "sex Man"
3 represents "number of night transactions 3 (more night transactions)":
7 represents "number of complaints 7 (more complaints)",
38877 represents "Lin Xin"
0 indicates "sex woman"
2 represents "number of night transactions 2 (more night transactions)":
1 represents "complaint number 1"
1 represents "no risk pool hit"
Correspondingly, the following natural language is converted:
male with the sex of the user's king, more transactions at night, more complaints, lin Xin sex women hit the gambling library, no transaction in the last 30 days, one complaint in the past, no risk library hit
Namely 7932 1,3,7,38877 0,2,1,1- > for men with the king sex, more transactions are performed at night, more complaints are performed, the users hit the gambling library for Lin Xin sex, more transactions are performed at night, one complaint is performed before, and no risk library hit occurs.
It should be noted that the above steps are considered: for feature "7932 1,3,7,38877 0,2,1,1", if hit various risk libraries are exploded because of the dimension, because the dimension considering single heat (one-hot) can be high, such as 1,2,3,4,9 columns become 5 columns, there are several unique values and how many columns there are, if there are 500 unique values, 500 columns, in which case the dimension is large resulting in consuming a lot of memory and computing resources, so it is difficult to perform single heat (one-hot) or linear processing, and the behavior log is also not legislated to be in a unified format for the model. While natural language does not present such a problem.
In the specific scenario, the features are converted into natural language through the steps, then the deep learning or other natural language processing tools are used for generating the features, and in the subsequent steps, the features are imported into the tree model, so that the advantage is that, as described above, in the prior art, the dimension is large, so that a large amount of memory and calculation resources are consumed, single-hot (one-hot) or linear processing is difficult to perform, and the behavior log is not regulated to be in a unified format for the model, but in the application, the structured features are converted into the natural language, the problem of large dimension is avoided, and therefore, the model which is not fully utilized before can be utilized, because many data are sparse, and the model is directly discarded without gain.
It should be noted that in the embodiments of the present application, there are various ways to generate the vector, for example, word2vec, cw2vec, cwe, or with other neural networks.
In other words, the variable can be generated as a feature to be generated by the deep learning model, thereby quantizing the information into a feature gain model effect. That is to say,
step 140: and training a text model based on the vector, adjusting parameters of the neural network to obtain a score for adding the tree model, and intercepting part of the vector of the neural network to take the intercepted part of the vector of the neural network as a variable for adding the tree model.
Specifically, in the present embodiment, the partial vector may be intercepted from any layer of the neural network, for example, a full connection layer, or a recurrent neural network (GRU), or a long short-term memory network (LSTM), or a Convolutional Neural Network (CNN), or the like.
In particular, the fully connected layer is a marker space for mapping the learned feature representation to the sample. Each node of the full connection layer is connected with all nodes of the upper layer and is used for integrating the features extracted by the front edge. The parameters of the fully connected layer are also generally the most due to their fully connected nature.
Specifically, the interception function, i.e. the subtstr function, in the embodiment of the present application, the interception setting may be to take out the corresponding full connection layer.
It should be noted that the tree model is a model which is most widely used in the field of machine learning, and is a model which is particularly diverse, in addition to deep learning. The tree model is mainly used for classification.
It is noted that embodiments of the present application may be used for types of tree models, including, but not limited to, lightgbm, xgboost, random forest, gbdt, decision tree models.
Specifically, this step further includes a sub-step of tuning parameters using the neural network, through which an optimal parameter is obtained.
Wherein, the parameter refers to the number of neurons of the network and the learning rate of the network. In the tree model, the depth of the tree and the number of leaf nodes are referred to. These parameters may cover different network structures and numbers of neurons, and by solving for the number of neurons and the network structure, a benign parameter is obtained, and a best-performing model is called out of these parameters.
Further, in this case, the fully connected layers are truncated using the fully connected layer truncation and used as variables of the tree model.
It should be noted that, in other embodiments of the present application, the code for intercepting the full connection layer code is not limited thereto, and other options are also possible, which will not be described herein.
Step 150: adding the variable and the score to the tree model.
Step 160: and constructing a data model deployment platform based on the tree model.
In other words, in steps 105-106, after adding the variable and the score to the tree model, based on the tree model with the added variable and score obtained in the above steps, accessing MPS deployment, and deploying by referring to a stacking model.
Note that MPS (Meta Programming System) is a language programming oriented tool. The developer may use it to extend the programming language, or it may use it to create Domain Specific Languages (DSLs) for enterprise applications. MPS maintains code with an Abstract Syntax Tree (AST). An AST is made up of nodes, which in turn contain attributes, child nodes, and references, by which program code is expressed in its entirety. When creating a new language, the developer defines rules for coding and expression, and may also specify constituent elements of the language type system. MPS also adopts the code generation approach: expressed at a higher level in the new language, the MPS then generates compilable code in the language Java, XML, HTML, javaScript, etc.
It should be noted that Stacking generally refers to training a multi-layer (e.g., two-layer) learner architecture, where a first layer (also called a learning layer) incorporates the predicted results into a new feature set using n different classifiers (or models of different parameters) and serves as input to the next layer classifier. Since the raw data may be cluttered, in stacking, valid features are learned after passing through the learners of the first layer. From this point of view, the first layer of stacking is the process of feature extraction. The upper row is data which is not stacked, the lower row is data which is processed by stacking (a plurality of unsupervised learning algorithms), and through experiments, the data is obviously found to be more obvious in demarcation in the lower row. Thus, efficient stacking can effectively extract features in the raw data.
Specifically, in this step, the online deployment scheme of the MPS, the whole set of deployment flow is as follows:
first, training the tree model with the variables added on the prediction platform. Then, a deployment link is arranged on the MPS, for example, a preprocessing module is the same as that in training, a neural network is used for generating a corresponding hidden layer, and the hidden layer is added into the variable of the tree model. And finally, deploying the MPS package to a model management platform server.
In this way, the first layer model can be introduced separately into the second layer model, after which predictions can be made online.
It should be noted that according to the above-described manner, the link average prediction time is only 70ms, which is very fast, through testing.
This embodiment has the following advantages:
first, since the missing values can be ignored as described above, descriptions of all targets can be compactly written, thereby solving the user behavior sparseness problem, more widely using data.
Moreover, since the input features can be naturally linguisticd, the predicted target can also be a segment of natural language, so the training target is more elastic.
A second embodiment of the present application relates to a structured data processing apparatus, the structure of which is shown in fig. 2, the structured data processing apparatus comprising: the system comprises a preprocessing module, a conversion module, a vector generation module and a call participation interception module, and specifically comprises the following steps:
and a pretreatment module: the method comprises the steps of obtaining structured data and preprocessing the structured data;
and a conversion module: the method comprises the steps of converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
and a vector generation module: a vector corresponding to the structured data is generated based on the natural language;
the adjustment participation intercepting module: and the method is used for training a text model based on the vector, adjusting parameters of a neural network, obtaining a score for adding a tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model.
Further, the above structured data processing apparatus may further include:
and (3) a joining module: for adding the variable and the score to the tree model.
Note that the first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the first embodiment.
It should be noted that, as will be understood by those skilled in the art, the implementation functions of the modules shown in the embodiments of the above structured data processing apparatus may be understood by referring to the description related to the foregoing structured data processing method. The functions of the modules shown in the above-described embodiments of the structured data processing apparatus may be implemented by a program (executable instructions) running on a processor, or may be implemented by a specific logic circuit. The above-described structured data processing apparatus according to the embodiments of the present application may also be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the prior art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Accordingly, the present description also provides a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method embodiments of the present description. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
In addition, the embodiment of the application also provides a generating device for the variable of the model construction, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the method embodiments described above when executing computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present invention may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
It should be noted that in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that an action is performed according to an element, it means that the action is performed at least according to the element, and two cases are included: the act is performed solely on the basis of the element and is performed on the basis of the element and other elements. Multiple, etc. expressions include 2, 2 times, 2, and 2 or more, 2 or more times, 2 or more.
All references mentioned in this specification are to be considered as being included in the disclosure of this specification in their entirety so as to be applicable as a basis for modification when necessary. Furthermore, it should be understood that the foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Claims (8)

1. A structured data processing method, comprising:
obtaining structured data and preprocessing the structured data;
converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
generating a vector corresponding to the structured data based on the natural language;
training a text model based on the vector, adjusting parameters of a neural network to obtain a score for adding a tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model;
also comprises:
and adding the variable and the score into the tree model, wherein the tree model is used for accessing MPS deployment, and deploying by referring to a stacking model.
2. The method of claim 1, wherein in the step of intercepting the partial vector of the neural network, the partial vector is intercepted from a full connection layer, or a recurrent neural network, or a long-short-term memory network, or a convolutional neural network.
3. The method of claim 1, wherein the structured data is one or any combination of the following: the number of night gambling transactions, the amount of night transactions, whether there is a short time of cashback at the time of transactions.
4. The method of claim 1, wherein in the step of generating a vector corresponding to the structured data based on the natural language, the vector is generated by any one of: word2vec, or cw2vec, or cwe.
5. The method of claim 1, wherein in the step of training a text model based on the vector, adjusting parameters of a neural network to obtain a score for joining a tree model, and intercepting a partial vector of the neural network to take the intercepted partial vector of the neural network as a variable for joining the tree model, further comprising: and performing parameter adjustment by using the neural network to obtain an optimal parameter, wherein the parameter refers to the neuron number of the neural network.
6. A structured data processing apparatus, comprising:
and a pretreatment module: the method comprises the steps of obtaining structured data and preprocessing the structured data;
and a conversion module: the method comprises the steps of converting the preprocessed structured data into text data, and splicing the text data to obtain natural language corresponding to the structured data;
and a vector generation module: a vector corresponding to the structured data is generated based on the natural language;
the adjustment participation intercepting module: the method comprises the steps of training a text model based on the vector, adjusting parameters of a neural network to obtain a score for adding a tree model, intercepting part of the vector of the neural network, and taking the intercepted part of the vector of the neural network as a variable for adding the tree model;
also comprises:
and (3) a joining module: and the variable and the score are added into the tree model, the tree model is used for accessing MPS deployment, and the tree model is used for deployment by referring to a stacking model.
7. A generating apparatus for a variable for model construction, comprising:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the method of any one of claims 1 to 5 when executing the computer executable instructions.
8. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the steps in the method of any one of claims 1 to 5.
CN201910258145.4A 2019-04-01 2019-04-01 Structured data processing method and device Active CN110162558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910258145.4A CN110162558B (en) 2019-04-01 2019-04-01 Structured data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910258145.4A CN110162558B (en) 2019-04-01 2019-04-01 Structured data processing method and device

Publications (2)

Publication Number Publication Date
CN110162558A CN110162558A (en) 2019-08-23
CN110162558B true CN110162558B (en) 2023-06-23

Family

ID=67638421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910258145.4A Active CN110162558B (en) 2019-04-01 2019-04-01 Structured data processing method and device

Country Status (1)

Country Link
CN (1) CN110162558B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110867231A (en) * 2019-11-18 2020-03-06 中山大学 Disease prediction method, device, computer equipment and medium based on text classification
CN111831805A (en) * 2020-07-01 2020-10-27 中国建设银行股份有限公司 Model creation method and device, electronic equipment and readable storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299741A (en) * 2018-06-15 2019-02-01 北京理工大学 A kind of network attack kind identification method based on multilayer detection
CN109508461A (en) * 2018-12-29 2019-03-22 重庆猪八戒网络有限公司 Order price prediction technique, terminal and medium based on Chinese natural language processing

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9898458B2 (en) * 2015-05-08 2018-02-20 International Business Machines Corporation Generating distributed word embeddings using structured information
US9672814B2 (en) * 2015-05-08 2017-06-06 International Business Machines Corporation Semi-supervised learning of word embeddings
US20160364419A1 (en) * 2015-06-10 2016-12-15 Blackbird Technologies, Inc. Image and text data hierarchical classifiers
CN108021981A (en) * 2016-10-31 2018-05-11 北京中科寒武纪科技有限公司 A kind of neural network training method and device
CN106845139A (en) * 2017-02-28 2017-06-13 北京赛迈特锐医疗科技有限公司 Structured report is generated the system and method for natural language report
US10380259B2 (en) * 2017-05-22 2019-08-13 International Business Machines Corporation Deep embedding for natural language content based on semantic dependencies
CN107622333B (en) * 2017-11-02 2020-08-18 北京百分点信息科技有限公司 Event prediction method, device and system
CN109299887B (en) * 2018-11-05 2022-04-19 创新先进技术有限公司 Data processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299741A (en) * 2018-06-15 2019-02-01 北京理工大学 A kind of network attack kind identification method based on multilayer detection
CN109508461A (en) * 2018-12-29 2019-03-22 重庆猪八戒网络有限公司 Order price prediction technique, terminal and medium based on Chinese natural language processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Order-Planning Neural Text Generation From Structured Data;Sha Lei;arxiv;1-8 *
Table-to-Text Generation by Structure-Aware Seq2seq Learning;Tianyu Liu;Proceedings of the AAAI Conference on Artificial Intelligence;4881-4888 *
Text Generation From Tables;Junwei Bao;IEEE/ACM Transactions on Audio, Speech, and Language Processing;第27卷;311-320 *

Also Published As

Publication number Publication date
CN110162558A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
US11599714B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
US11250033B2 (en) Methods, systems, and computer program product for implementing real-time classification and recommendations
US10705796B1 (en) Methods, systems, and computer program product for implementing real-time or near real-time classification of digital data
US10467122B1 (en) Methods, systems, and computer program product for capturing and classification of real-time data and performing post-classification tasks
US10685189B2 (en) System and method for coupled detection of syntax and semantics for natural language understanding and generation
US20220100963A1 (en) Event extraction from documents with co-reference
US10650190B2 (en) System and method for rule creation from natural language text
US10978053B1 (en) System for determining user intent from text
US20220100772A1 (en) Context-sensitive linking of entities to private databases
US20230259707A1 (en) Systems and methods for natural language processing (nlp) model robustness determination
KR102588332B1 (en) Method for generating storyboard based on script text
CN110162558B (en) Structured data processing method and device
Pintye et al. Big data and machine learning framework for clouds and its usage for text classification
US20220229994A1 (en) Operational modeling and optimization system for a natural language understanding (nlu) framework
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
US20220100967A1 (en) Lifecycle management for customized natural language processing
WO2022072237A1 (en) Lifecycle management for customized natural language processing
US11816422B1 (en) System for suggesting words, phrases, or entities to complete sequences in risk control documents
CN114266255B (en) Corpus classification method, apparatus, device and storage medium based on clustering model
CN112100364A (en) Text semantic understanding method and model training method, device, equipment and medium
Karim Scala Machine Learning Projects: Build real-world machine learning and deep learning projects with Scala
KR102497436B1 (en) Method for acquiring information related to a target word based on content including a voice signal
US20230168989A1 (en) BUSINESS LANGUAGE PROCESSING USING LoQoS AND rb-LSTM
Dasgupta et al. A Review of Generative AI from Historical Perspectives
Jing et al. Classifying Packages for Building Linux Distributions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200929

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200929

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant