WO2022048432A1 - 构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质 - Google Patents

构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022048432A1
WO2022048432A1 PCT/CN2021/112762 CN2021112762W WO2022048432A1 WO 2022048432 A1 WO2022048432 A1 WO 2022048432A1 CN 2021112762 W CN2021112762 W CN 2021112762W WO 2022048432 A1 WO2022048432 A1 WO 2022048432A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
application scenario
aggregated
item
Prior art date
Application number
PCT/CN2021/112762
Other languages
English (en)
French (fr)
Inventor
洪立涛
晁涌耀
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022048432A1 publication Critical patent/WO2022048432A1/zh
Priority to US18/072,622 priority Critical patent/US20230094293A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/091Active learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Definitions

  • the present application relates to artificial intelligence technology, and in particular, to a method for constructing a recommendation model, a method, an apparatus, an electronic device, and a computer-readable storage medium for constructing a neural network model.
  • Artificial intelligence is a comprehensive technology of computer science. By studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive subject covering a wide range of fields, such as natural language processing technology and machine learning/deep learning. With the development of technology, artificial intelligence technology will be applied in more fields, and play a more increasingly important value.
  • neural network models such as recommendation models, which can help users discover information that may be of interest to them in an information-overloaded environment, and push the information to users who are interested in them.
  • Embodiments of the present application provide a method for constructing a recommendation model, a method, an apparatus, an electronic device, and a computer-readable storage medium for constructing a neural network model, which can pre-aggregate multiple feature tables corresponding to application scenarios and improve the efficiency of constructing a recommendation model.
  • the embodiment of the present application provides a method for constructing a recommendation model, including:
  • the recommendation item includes multiple application scenarios corresponding to multiple recommendation indicators of the item to be recommended, and the recommendation model of each application scenario is used to predict the corresponding recommendation indicator;
  • the trained recommendation model can fit the user features and item features in the training sample set.
  • the embodiment of the present application provides a method for constructing a neural network model, including:
  • the application item includes multiple application scenarios corresponding to multiple application indicators one-to-one, and the neural network model of each application scenario is used to predict the corresponding application indicator;
  • the trained neural network model can fit the features in the training sample set.
  • the embodiment of the present application provides an apparatus for constructing a recommendation model, including:
  • the first aggregation module is configured to perform aggregation processing on multiple feature tables corresponding to each application scenario in the recommended item, and send the obtained aggregated feature table to the cache space; wherein the recommended item includes a plurality of items related to the item to be recommended.
  • the recommendation indicators correspond to multiple application scenarios one-to-one, and the recommendation model of each application scenario is used to predict the corresponding recommendation indicators;
  • the first splicing module is configured to read the corresponding user features and item features from the aggregated feature table in the cache space based on the user ID and item ID included in the sample data table, and splicing with the sample data table processing to form a training sample set;
  • the first training module is configured to train a recommendation model of the application scenario based on the training sample set, wherein the trained recommendation model can fit user features and item features in the training sample set.
  • the embodiment of the present application provides a device for constructing a neural network model, including:
  • the second aggregation module is configured to perform aggregation processing on multiple feature tables corresponding to each application scenario in the application project, and send the obtained aggregated feature table to the cache space; wherein, the application project includes a plurality of application indicators one by one. Corresponding multiple application scenarios, the neural network model of each application scenario is used to predict the corresponding application indicators;
  • a second splicing module configured to read corresponding features from the aggregated feature table in the cache space based on the feature identifiers included in the sample data table, and perform splicing processing with the sample data table to form a training sample set;
  • the second training module is configured to train a neural network model of the application scenario based on the training sample set, wherein the trained neural network model can fit the features in the training sample set.
  • An embodiment of the present application provides an electronic device for building a recommendation model, the electronic device comprising:
  • the processor is configured to implement the method for building a recommendation model provided by the embodiment of the present application when executing the executable instructions stored in the memory.
  • An embodiment of the present application provides an electronic device for constructing a neural network model, the electronic device comprising:
  • the processor is configured to implement the method for building a neural network model provided by the embodiments of the present application when executing the executable instructions stored in the memory.
  • the embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the method for constructing a recommendation model provided by the embodiments of the present application.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the method for building a neural network model provided by the embodiments of the present application.
  • the obtained aggregated feature table is sent to the cache space, so that the aggregated feature table can be reused, the neural network model of the application scenario can be trained, and the consumption of computer resources can be reduced. waste to improve the efficiency of building neural network models.
  • FIG. 1 is a schematic diagram of an application scenario of a recommendation system provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an electronic device for recommending model building provided by an embodiment of the present application
  • 3-4 are schematic flowcharts of a method for constructing a neural network model provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a splicing cache feature provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a method for constructing a recommendation model provided by an embodiment of the present application
  • FIG. 7 is a schematic flowchart of offline feature splicing provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a first-level cache provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a periodic feature aggregation provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a splicing history feature provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a 2-level index provided by an embodiment of the present application.
  • first ⁇ second involved is only to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that “first ⁇ second” can be used when permitted.
  • the specific order or sequence is interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
  • Feature splicing The preparatory work of the machine learning modeling process, including the original features and sample preparation, the features are distributed in multiple storage locations, and are not associated with the samples, so the algorithm training input cannot be directly performed, and the samples and features need to be spliced Processing and then model training.
  • Feature Index Hash After the feature (from the feature table) and the sample (from the sample data table) are spliced together, the feature key needs to be encoded to generate the corresponding index (index), which is directly input to the The model is trained, in order to improve the feature index generation performance, the hash method is used to generate the hash value, and its time complexity is O(1), thereby greatly improving the performance.
  • Multi-cycle cache Algorithm training often requires samples to be spliced with features of multiple cycles. In order to enrich the sample size and achieve better modeling results, when multiple cycles are spliced, if the performance is particularly low when splicing in full each time, it is necessary to perform splicing. Cache optimization, which can cache historical stitched periodic data for subsequent incremental stitching.
  • the sample in the embodiment of the present application represents original identification data, for example, the sample includes identification data such as user identification, item identification, label, and weight.
  • the features in the embodiments of the present application represent entity data associated with the samples, for example, the features include features of user portraits, click features of items, statistical features of texts, and the like.
  • Data cleaning perform certain processing on the provided original data to facilitate subsequent feature extraction.
  • Data cleaning includes data splicing. Since the provided data is scattered in multiple files, data splicing needs to be performed according to the corresponding key values.
  • the neural network models described in the embodiments of the present application can be applied to various fields, for example, image recognition neural networks, text recommendation neural networks, etc., that is, the neural network models in the embodiments of the present application are not limited to a certain field.
  • Feature stitching is required before the model is trained.
  • Feature stitching is an important module of machine learning modeling. Industrial modeling scenarios often face large scale of samples and features, and multiple algorithms need to be run in parallel. The performance of stitching affects modeling efficiency. great influence.
  • the feature stitching technology includes online real-time stitching and offline stitching.
  • online real-time stitching is to generate labels through real-time user behavior, such as click-through rate scenarios, often clicked samples are used as positive samples, exposed but not clicked samples are used as negative samples, and then associated with features to generate the input required for algorithm training.
  • online real-time splicing is not applicable to scenarios where sample tags are determined in real time, and does not support multi-cycle historical cache reuse, new scenarios, algorithm hot start, and historical data splicing.
  • offline stitching supports historical data analysis to generate samples.
  • Each stitching of samples requires more than 90 historical cycles, and the samples and features are in the order of hundreds of millions, which requires very high performance.
  • Offline splicing cannot meet the needs of large-scale splicing, and it is not suitable for multi-algorithm parallelism, and feature splicing cannot reuse cache.
  • the embodiments of the present application provide a method for constructing a recommendation model, a method, an apparatus, an electronic device, and a computer-readable storage medium for constructing a neural network model, which can pre-aggregate multiple feature tables corresponding to application scenarios, thereby
  • the training algorithm can reuse the aggregated feature table, train the neural network model of the application scenario, reduce the waste of computer resources, and improve the efficiency of building a recommendation model.
  • the method for building a neural network model may be implemented by the terminal/server alone; or may be implemented collaboratively by the terminal and the server, for example, the terminal alone undertakes the method for building a neural network model described below, or the terminal sends
  • the server sends a training request for the neural network model
  • the server executes the method for building a neural network model according to the received training request for the neural network model, and sends the generated neural network model to the terminal, so as to predict the corresponding application index through the neural network model.
  • the electronic device used for constructing the neural network model may be various types of terminal devices or servers, wherein the server may be an independent physical server, or a server cluster or distributed server composed of multiple physical servers.
  • the system can also be a cloud server that provides cloud computing services; the terminal can be a smart phone, tablet computer, notebook computer, desktop computer, smart speaker, smart watch, etc., but is not limited to this.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • a server can be a server cluster deployed in the cloud to open artificial intelligence cloud services (AIaaS, AI as a Service) to users.
  • AIaaS artificial intelligence cloud services
  • the AIaaS platform will split several types of common AI services and provide independent services in the cloud. Or packaged services. This service model is similar to an AI-themed mall. All users can access one or more artificial intelligence services provided by the AIaaS platform through application programming interfaces.
  • one of the artificial intelligence cloud services may be a neural network model construction service, that is, a server in the cloud is encapsulated with the neural network model construction program provided by the embodiments of the present application.
  • the developer calls the neural network model building service in the cloud service through the terminal (running a client, such as a configuration client), so that the server deployed in the cloud can call the program built by the encapsulated neural network model, and aggregate the feature table from the cache space.
  • the corresponding features are read in and spliced with the sample data table to form a training sample set, and the neural network model of the application scenario is trained based on any training algorithm and the training sample set to respond to the training request for the neural network model.
  • the neural network model is used to predict the corresponding application index, and the neural network model may be an image neural network model, a text neural network model, or the like.
  • the server or terminal may perform aggregation processing on multiple feature tables corresponding to each application scenario in the image recognition project, and send the obtained aggregated feature table to the cache space, wherein,
  • the image item includes multiple application scenarios that correspond one-to-one with multiple predictors of the object to be recognized.
  • the image recognition neural network model of each application scenario is used to predict the corresponding predictor.
  • Based on the object identifier included in the sample data table, from the cache The corresponding object features are read from the aggregated feature table of the space, and spliced with the sample data table to form a training sample set.
  • the image recognition neural network model of the application scenario is trained. , so that the image recognition neural network model can fit the object features in the training sample set, wherein a plurality of training algorithms are used to train the image recognition neural network model of the application scene, so as to predict the corresponding prediction index through the image recognition neural network model.
  • the function of constructing the neural network model provided by the embodiment of the present application is invoked, and the image recognition item includes a frontal face recognition scene and a side face recognition scene. Identify, read the corresponding frontal face features from the aggregated feature table in the cache space, and spliced with the sample data table to form a training sample set, and train frontal face recognition based on any training algorithm of multiple training algorithms and the training sample set
  • the frontal face recognition neural network model of the scene is used to predict the corresponding frontal face indicators, such as the probability of belonging to a user's frontal face, through the frontal face recognition neural network model.
  • the embodiment of the present application can be combined with a neural network model for face recognition and a neural network model for side face recognition to perform frontal face recognition and side face recognition on pedestrians who need to pass through the access control, so as to improve the accuracy of face recognition and enhance the safety factor of the access control.
  • the server or terminal may perform aggregation processing on multiple feature tables corresponding to each application scenario in the text recommendation item, and send the obtained aggregated feature table to the cache space, wherein,
  • the image item includes multiple application scenarios that correspond to multiple recommendation indicators of the text to be recommended.
  • the text recommendation neural network model of each application scenario is used to predict the corresponding recommendation indicators, based on the user ID and text ID included in the sample data table.
  • the function constructed by the neural network model provided in the embodiment of the present application is invoked, and the text recommendation item includes a news click rate prediction scenario and a news exposure rate prediction scenario.
  • the news click rate based on the sample data table, The corresponding user and news features are read from the aggregated feature table of the cache space, and spliced with the sample data table to form a training sample set, based on any training algorithm of multiple training algorithms and The training sample set trains the click-through rate prediction model of the news click-through rate scene, so as to predict the corresponding click-through rate through the click-through rate prediction model.
  • the embodiment of the present application can combine the click rate prediction model and the exposure rate prediction model to predict the click rate and the exposure rate of the news, and combine the click rate and the exposure rate of the news to determine whether to recommend the news, so as to improve the accuracy of the news recommendation, Recommend news that is more in line with the user's interests to the user.
  • FIG. 1 is a schematic diagram of an application scenario of the recommendation system 10 provided by the embodiment of the present application.
  • the terminal 200 is connected to the server 100 through the network 300, and the network 300 may be a wide area network or a local area network, or two combination of.
  • the terminal 200 (running a client, such as a configuration client) can be used to obtain a training request for the recommendation model. For example, after the user inputs multiple feature tables corresponding to the application scenario in the client, the terminal automatically obtains the training request for the recommendation model. training request.
  • a recommendation model building plug-in may be embedded in the client running in the terminal, so as to implement the method for building a recommendation model locally on the client.
  • the terminal 200 calls the recommendation model building plug-in to implement the method for building the recommendation model, reads the corresponding user features and item features from the aggregated feature table in the cache space, and compares them with the sample data table Perform splicing to form a training sample set, and train a recommendation model for an application scenario based on any training algorithm and training sample set to respond to a training request for the recommendation model, and then predict the corresponding recommendation indicators according to the recommendation model, such as through the recommendation model prediction.
  • the exposure rate of the product, and whether to recommend the product based on the exposure rate of the product, so as to help users discover products that may be of interest to them.
  • the terminal 200 invokes the recommendation model building interface of the server 100 (which can be provided in the form of a cloud service, that is, the recommendation model building service).
  • the server 100 aggregates features from the cache space
  • the corresponding user features and item features are read from the table, and spliced with the sample data table to form a training sample set, and the recommendation model of the application scenario is trained based on any training algorithm and training sample set to respond to the training of the recommended model. ask.
  • the terminal or server may implement the method for building a recommendation model provided by the embodiments of the present application by running a computer program.
  • the computer program may be a native program or software module in an operating system; it may be a native (Native) Application (APP, Application), that is, a program that needs to be installed in the operating system to run; it can also be a small program, that is, a program that can be run only by downloading it into the browser environment; it can also be embedded in any APP the applet.
  • APP Native
  • the above-mentioned computer programs may be any form of application, module or plug-in.
  • FIG. 2 is a schematic structural diagram of an electronic device 500 for building a recommended model provided by an embodiment of the present application.
  • the electronic device 500 is Taking a server as an example, the electronic device 500 for building a recommendation model shown in FIG. 2 includes: at least one processor 510 , a memory 550 and at least one network interface 520 .
  • the various components in electronic device 500 are coupled together by bus system 530 .
  • the bus system 530 is used to implement the connection communication between these components.
  • the bus system 530 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 530 in FIG. 2 .
  • the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • the memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.
  • Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .
  • memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • the apparatus for constructing a recommendation model provided by the embodiments of the present application may be implemented in software, for example, the above-mentioned recommendation model building plug-in in the terminal, or the above-mentioned recommendation model in the server Model building service.
  • the apparatus for constructing a recommendation model provided by the embodiments of the present application may be provided as various software embodiments, including various forms including application programs, software, software modules, scripts, or codes.
  • FIG. 2 shows a device 555 for building a recommendation model stored in the memory 550, which may be software in the form of programs and plug-ins, such as a recommendation model building plug-in, and includes a series of modules, including a first aggregation module 5551, a first aggregation module 5551, a first A splicing module 5552 and a first training module 5553; wherein, the first aggregation module 5551, the first splicing module 5552 and the first training module 5553 are used to implement the recommendation model building function provided by the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for building a neural network model provided by an embodiment of the present application, which is described in conjunction with the steps shown in FIG. 3 .
  • neural network models described in the embodiments of the present application can be applied to various fields, such as image recognition neural networks, text recommendation neural networks, etc., that is, the neural network models in the embodiments of the present application are not limited to certain kinds of fields.
  • application items represent specific application tasks, such as face recognition, text recommendation, and so on.
  • the application scene represents the scene corresponding to the subtask under the application project, such as the front face recognition scene and the side face recognition scene under the face recognition scene.
  • the feature table includes a plurality of feature key-value pairs.
  • step 101 aggregation processing is performed on multiple feature tables corresponding to each application scenario in the application project, and the obtained aggregated feature table is sent to the cache space.
  • the application item includes multiple application scenarios corresponding to multiple application indicators one-to-one, and the neural network model of each application scenario is used to predict the corresponding application indicator.
  • the central processing unit (CPU, Central Processing Unit) of the server obtains multiple feature tables corresponding to each application scenario in the application project from the file system in advance, and performs pre-aggregation processing on the multiple feature tables to obtain the Aggregate feature tables, and send the aggregated feature tables to the cache space (high-speed memory, database, etc.) for subsequent reuse of the aggregated feature tables, avoiding the need to splicing feature tables for each training, and reducing computational complexity.
  • CPU Central Processing Unit
  • performing aggregation processing on multiple feature tables corresponding to each application scenario in the recommended item includes: performing the following processing for each application scenario of the application item: performing the following processing on at least part of the multiple feature tables corresponding to the application scenario
  • the features are aggregated and deduplicated to obtain the aggregated feature table of the application scenario.
  • each feature identifier in the aggregated feature table can be combined to obtain the feature metadata table of the application scenario.
  • the feature metadata table includes information such as feature identification, the source of the feature (which feature table the feature comes from), and the location of the feature in the aggregated feature table.
  • the feature can be indexed from the feature metadata table first, and when the feature is indexed, it means that the feature exists in the aggregated feature table, so that the feature can be read from the aggregated feature table.
  • performing aggregation processing on at least part of the features of the multiple feature tables corresponding to the application scenario includes: performing aggregation processing on all the features of the multiple feature tables corresponding to the application scenario.
  • the central processor reads all the features of multiple feature tables, and aggregates all the features to obtain an aggregated feature table containing all the features. Therefore, as long as there is no new feature table, the subsequent splicing operation can be directly from the Read features in the aggregated feature table, that is, the aggregated feature table will not miss any feature in the feature table.
  • performing aggregation processing on at least part of the features of the multiple feature tables corresponding to the application scenarios includes: determining, from the multiple feature tables corresponding to each application scenario in the application project, a recommendation model for training the application scenario Common features of multiple training algorithms; perform aggregation processing on common features to obtain an aggregated feature table of application scenarios.
  • the features common to multiple training algorithms can be pre-aggregated, that is, the features with higher frequency are used for aggregation to obtain the aggregated features of the application scenario. This increases the read frequency of features in the aggregated feature table and reduces the size of the aggregated feature table, so that the features in the aggregated feature table can be quickly read later.
  • the newly added feature table and the aggregated feature table of the application scenario are spliced to obtain a new aggregated feature table, and the cache space is processed based on the new aggregated feature table. Incremental update.
  • the central processing unit when it detects that a new feature table exists in the file system, it can read the aggregate feature table of the application scenario from the cache space, and read the new feature table from the file system.
  • the aggregated feature tables of the scene are spliced to obtain a new aggregated feature table, and based on the new aggregated feature table, the cache space is completely updated to replace the old aggregated feature table.
  • performing aggregation processing on multiple feature tables corresponding to each application scenario in the application project includes: when each new cycle of each application scenario arrives, performing an aggregation process on the newly added feature table corresponding to the new cycle. Aggregation processing is performed to obtain an aggregated feature table of a new period; the aggregated feature table of each new period is spliced to obtain an aggregated feature table of the application scenario.
  • the feature table in each application scenario has a corresponding update cycle, such as a monthly update cycle, a weekly update cycle, and a daily update cycle.
  • a corresponding update cycle such as a monthly update cycle, a weekly update cycle, and a daily update cycle.
  • multiple The feature table is aggregated to obtain the aggregate feature table of the new cycle.
  • the feature table of the monthly update cycle is aggregated to obtain the aggregate feature table of each month; for each week, the features of the weekly update cycle are aggregated. Aggregate the table to obtain the aggregated feature table of each week; for each day, aggregate the feature table of the daily update period to obtain the aggregated feature table of each day.
  • the aggregated feature table of each new cycle is spliced to obtain the aggregated feature table of the application scenario.
  • the original features are divided according to the update cycle.
  • the monthly feature table only needs to be calculated once a month
  • the weekly feature table only needs to be calculated once a week
  • the daily feature table only needs to be calculated It needs to be calculated once a day to aggregate the features of the corresponding period respectively, and then start the day-level merge task. Therefore, it is not necessary to aggregate the features in all the feature tables, but only the month, week, and day result tables (aggregate feature tables, such as The aggregated features of the monthly aggregated feature table, the weekly aggregated feature table, and the daily aggregated feature table) are combined, and the performance can be greatly improved by calculating in this way.
  • aggregate the features of the corresponding period respectively, obtain the aggregated feature table of June, the aggregated feature table of the first week, and the aggregated feature table of June 1, and then start the day-level merging task, that is, the aggregated feature table of June, Aggregate the aggregated feature table of the first week and the aggregated feature table of June 1 to obtain the aggregated feature table of the application scenario, thereby avoiding the need to aggregate all feature tables in the application scenario every day for the daily update cycle, reducing computing resources. .
  • the cache space is incrementally updated based on the aggregated feature table of the new cycle; when any new cycle of each application scenario does not arrive, the Before the aggregation feature table of each new cycle is spliced, read the aggregation feature table of the historical cycle corresponding to the new cycle in the cache space, and use it as the aggregation feature table of the new cycle; wherein, the historical cycle corresponding to the new cycle is the new cycle. The cycle before the cycle.
  • the aggregated feature table of the new period is used as the aggregated feature table of the historical period corresponding to the next new period.
  • the monthly update cycle when the If the update time of each month is not reached, read the monthly aggregation feature table closest to this month in the cache space (the aggregation feature table of the historical period), and use the monthly aggregation feature table closest to this month as the aggregation of this month feature table, so as to perform splicing processing on the aggregated feature table of each new cycle to obtain the aggregated feature table of the application scenario.
  • FIG. 4 is a schematic flowchart of a method for building a neural network model provided by an embodiment of the present application.
  • FIG. 4 shows that step 101 in FIG. 3 can be implemented through steps 1011 to 1013 shown in FIG. 4: in step 1011
  • the feature identifier in the aggregated feature table is updated to the shaping value to obtain a compressed aggregated feature table; in step 1012
  • the compressed aggregated feature table is sent to the cache space.
  • the feature identifier is relatively long for easy understanding, so the storage space of the aggregated feature table is large.
  • the feature identifiers in the aggregated feature table can be mapped to obtain the shaping value of the feature identifier, reduce the length of the feature identifier, and update the feature identifier in the aggregated feature table to the shaping value to obtain Compressed aggregate feature table; and send the compressed aggregate feature table to the cache space, thereby reducing the occupancy of the aggregate feature table on the cache space and improving the read ability of the aggregate feature table.
  • step 102 based on the feature identifiers included in the sample data table, the corresponding features are read from the aggregated feature table in the cache space, and spliced with the sample data table to form a training sample set.
  • the sample data table includes multiple samples, and each sample includes multiple feature identifiers.
  • the developer can input the sample data table and training algorithm in the terminal, the terminal automatically obtains the training request based on the sample data table and the training algorithm, and sends the training request based on the sample data table and the training algorithm to the server, the server After receiving the training request based on the sample data table and the training algorithm, based on the feature identifiers included in the sample data table, the corresponding features are read from the aggregated feature table of the cache space, and spliced with the sample data table to form a training sample set , so as to realize the multiplexing of the aggregated feature table and quickly perform the splicing operation of the sample data table.
  • the corresponding feature when the corresponding feature is not read from the aggregated feature table of the cache space, the corresponding feature is read from multiple feature tables corresponding to the application scenario, and spliced with the sample data table to form a training sample set.
  • the central processing unit preferentially reads the features corresponding to the sample data table from the feature table of the cache space, and then reads from the file system when the features corresponding to the sample data table are not read from the feature table of the cache space Apply multiple feature tables corresponding to the scenario, and read the features corresponding to the sample data tables from multiple feature tables, so as to avoid reading features from the file system every time. By reading features from the cache space, it can be greatly improved. The read rate of the feature.
  • the corresponding features are read from the aggregated feature table in the cache space, and the splicing process is performed with the sample data table, including: when each new cycle of each application scenario When it arrives, the following processing is performed: based on the feature identifier included in the sample data table of the new cycle, the corresponding feature is read from the aggregated feature table of the new cycle in the cache space; Cache characteristics for the new cycle.
  • modeling in most application scenarios requires samples from multiple historical periods to construct enough training data to allow the algorithm to learn more knowledge and improve robustness. For example, if the daily update model requires 1 month (30 cycles) of samples, it is necessary to record the sample data of 30 cycles. When any cycle of the 30 cycles arrives, based on the feature identifiers included in the sample data table of the cycle, The corresponding feature is read from the aggregated feature table of the cycle in the cache space, and the feature is spliced with the sample data table of the cycle to obtain the cache feature of the cycle.
  • the cache space is incrementally updated based on the cache characteristics of the new period; and multiple cache characteristics of the historical period corresponding to the new period are read from the cache space ; Perform splicing processing on multiple cache features of the historical period and the cache features of the new period to obtain a training sample set of the application scenario; wherein, the historical period is the period before the new period.
  • the cache feature of the new cycle is used as the cache feature of the historical cycle corresponding to the next new cycle.
  • the cache characteristics of the 30th cycle the cache characteristics of the new cycle
  • the cache characteristics of the 1st to 29th cycles are obtained from the cache space
  • the cache characteristics of the 1st to 30th cycles are obtained.
  • the cached features are spliced to obtain the training sample set of the application scenario.
  • primary key coding is performed on the feature identifier of each training sample in the training sample set to obtain the primary key coding value of the feature identifier; secondary key coding is performed on the feature identifier of each training sample to obtain The secondary key coding value of the feature identification; splicing the primary key coding value and the secondary key coding value to obtain the index coding value of the feature identification;
  • the training samples train the recommendation model for the application scenario.
  • the central processing unit encodes each training sample in the training sample set to obtain a format for model training, and compresses the training samples to reduce the storage space of the training samples.
  • the index code value of the feature identifier is obtained.
  • the index code value can effectively reduce the conflict rate of encoding, that is, the index code value can uniquely characterize the feature identifier. .
  • a neural network model of the application scenario is trained based on the training sample set, so that the neural network model can fit the features in the training sample set.
  • the neural network model of the application scenario is trained based on any one of the multiple training algorithms and the training sample set, so that the neural network model can fit the features in the training sample set, and the multiple training algorithms are used to train the neural network of the application scenario.
  • Model, the training algorithm includes the hyperparameters of the model, the loss function, etc.
  • the central processing unit obtains the training sample set
  • the neural network model of the application scenario can be trained based on any training algorithm of multiple training algorithms and the training sample set to construct the neural network model of the application scenario, and the neural network model of the application scenario can be passed through the neural network model of the application scenario. Predict the corresponding application metrics.
  • the neural network model of the application scene includes a frontal face recognition neural network model and a side face recognition neural network model.
  • the frontal face recognition neural network model is used to predict the type of the frontal face (the probability of belonging to a user's frontal face), and the side face recognition
  • the neural network model is used to predict the profile category (probability of belonging to a user's profile).
  • FIG. 6 is a schematic flowchart of a method for constructing a recommendation model provided by an embodiment of the present application, which is described in conjunction with the steps shown in FIG. 6 .
  • the recommended items represent specific recommended objects, such as news, commodities, videos, and so on.
  • step 201 aggregation processing is performed on multiple feature tables corresponding to each application scenario in the recommended item, and the obtained aggregated feature table is sent to the cache space.
  • the recommendation item includes multiple application scenarios one-to-one corresponding to multiple recommendation indicators of the item to be recommended, and the recommendation model of each application scenario is used to predict the corresponding recommendation indicator.
  • the central processor of the server obtains multiple feature tables corresponding to each application scenario in the recommended item from the file system in advance, and performs pre-aggregation processing on the multiple feature tables to obtain the aggregated feature table of the application scenario, and aggregates the feature tables.
  • the feature table is sent to the cache space for subsequent multiplexing and aggregation of the feature table, avoiding the need to splicing the feature table for each training and reducing the computational complexity.
  • performing aggregation processing on multiple feature tables corresponding to each application scenario in the recommended item includes: performing the following processing on each application scenario of the recommended item: at least part of the multiple feature tables corresponding to the application scenario
  • the features are aggregated and deduplicated to obtain an aggregated feature table of the application scenario; and each feature identifier in the aggregated feature table is combined to obtain a feature metadata table of the application scenario.
  • the feature can be indexed from the feature metadata table first, and when the feature is indexed, it means that the feature exists in the aggregated feature table, so that the feature can be read from the aggregated feature table.
  • performing aggregation processing on at least part of the features of the multiple feature tables corresponding to the application scenario includes: performing aggregation processing on all the features of the multiple feature tables corresponding to the application scenario.
  • performing aggregation processing on at least part of the features of multiple feature tables corresponding to the application scenarios includes: determining a recommendation model for training the application scenarios from the multiple feature tables corresponding to each application scenario in the recommendation item Common features of multiple training algorithms; perform aggregation processing on common features to obtain an aggregated feature table of application scenarios.
  • the newly added feature table and the aggregated feature table of the application scenario are spliced to obtain a new aggregated feature table, and the cache space is processed based on the new aggregated feature table. Incremental update.
  • the central processing unit when it detects that a new feature table exists in the file system, it can read the aggregate feature table of the application scenario from the cache space, and read the new feature table from the file system.
  • the aggregated feature tables of the scene are spliced to obtain a new aggregated feature table, and based on the new aggregated feature table, the cache space is completely updated to replace the old aggregated feature table.
  • performing aggregation processing on multiple feature tables corresponding to each application scenario in the recommended item includes: when each new cycle of each application scenario arrives, performing an aggregation process on the newly added feature table corresponding to the new cycle. Aggregation processing is performed to obtain an aggregated feature table of a new period; the aggregated feature table of each new period is spliced to obtain an aggregated feature table of the application scenario.
  • the cache space is incrementally updated based on the aggregated feature table of the new cycle; when any new cycle of each application scenario does not arrive, the Before the aggregation feature table of each new cycle is spliced, read the aggregation feature table of the historical cycle corresponding to the new cycle in the cache space, and use it as the aggregation feature table of the new cycle; wherein, the historical cycle corresponding to the new cycle is the new cycle. The cycle before the cycle.
  • the aggregated feature table of the new period is used as the aggregated feature table of the historical period corresponding to the next new period.
  • the monthly update cycle when the If the update time of each month is not reached, read the monthly aggregation feature table closest to this month in the cache space (the aggregation feature table of the historical period), and use the monthly aggregation feature table closest to this month as the aggregation of this month feature table, so as to perform splicing processing on the aggregated feature table of each new cycle to obtain the aggregated feature table of the application scenario.
  • sending the obtained aggregated feature table to the cache space includes: performing mapping processing on the feature identifiers in the obtained aggregated feature table to obtain an integer value of the feature identifier; updating the feature identifier in the aggregated feature table to Shape the value to get the compressed aggregate feature table; send the compressed aggregate feature table to the cache space.
  • step 202 based on the user ID and item ID included in the sample data table, the corresponding user features and item features are read from the aggregated feature table in the cache space, and spliced with the sample data table to form a training sample set.
  • the sample data table includes multiple samples, and each sample includes multiple feature identifiers, such as user identifiers and item identifiers.
  • the developer can input the sample data table and training algorithm on the terminal, the terminal automatically obtains the training request based on the sample data table and the training algorithm, and sends the training request based on the sample data table and the training algorithm to the server, the server After receiving the training request based on the sample data table and the training algorithm, based on the user ID and item ID included in the sample data table, read the corresponding user features and item features from the aggregated feature table in the cache space, and compare the data with the sample data table. Splicing to form a training sample set, so as to realize the multiplexing of the aggregated feature table and quickly perform the splicing operation of the sample data table.
  • the corresponding user feature or item feature when the corresponding user feature or item feature is not read from the aggregated feature table in the cache space, the corresponding user feature or item feature is read from multiple feature tables corresponding to the application scenario, and combined with The sample data tables are concatenated to form a training sample set.
  • the central processing unit preferentially reads the user feature or item feature corresponding to the sample data table from the feature table of the cache space, when the user feature or item feature corresponding to the sample data table is not read from the feature table of the cache space , and then read multiple feature tables corresponding to the application scenario from the file system, and read the user features or item features corresponding to the sample data table from multiple feature tables, so as to avoid reading user features from the file system every time.
  • item features by reading user features or item features from the cache space, the reading rate of features can be greatly improved.
  • the corresponding user features and item features are read from the aggregated feature table in the cache space, and spliced with the sample data table, including: when each application When each new cycle of the scene arrives, the following processing is performed: based on the user ID and item ID included in the sample data table of the new cycle, the corresponding user features and item features are read from the aggregated feature table of the new cycle in the cache space; The user characteristics and item characteristics are spliced with the sample data table of the new period to obtain the cache characteristics of the new period.
  • the sample data table based on the cycle includes the user ID and Item identification, read the corresponding user features and item features from the aggregated feature table of the cycle in the cache space, and splicing the user features and item features with the sample data table of the cycle to obtain the cache feature of the cycle.
  • the cache space is incrementally updated based on the cache characteristics of the new period; and multiple cache characteristics of the historical period corresponding to the new period are read from the cache space ; Perform splicing processing on multiple cache features of the historical period and the cache features of the new period to obtain a training sample set of the application scenario; wherein, the historical period is the period before the new period.
  • the cache feature of the new cycle is used as the cache feature of the historical cycle corresponding to the next new cycle.
  • the cache characteristics of the 30th cycle the cache characteristics of the new cycle
  • the cache characteristics of the 1st to 29th cycles are obtained from the cache space
  • the cache characteristics of the 1st to 30th cycles are obtained.
  • the cached features are spliced to obtain the training sample set of the application scenario.
  • primary key coding is performed on the feature identifier of each training sample in the training sample set to obtain the primary key coding value of the feature identifier; secondary key coding is performed on the feature identifier of each training sample to obtain The secondary key coding value of the feature identification; splicing the primary key coding value and the secondary key coding value to obtain the index coding value of the feature identification;
  • the training samples train the recommendation model for the application scenario.
  • step 203 the recommendation model of the application scenario is trained based on the training sample set, so that the recommendation model can fit the user features and item features in the training sample set.
  • the recommendation model of the application scenario is trained based on any one of the multiple training algorithms and the training sample set, so that the recommendation model can fit the user features and item features in the training sample set, and the multiple training algorithms are used for training the application scenarios.
  • Recommended models, training algorithms include model hyperparameters, loss functions, etc.
  • the central processor obtains the training sample set, it can train the recommendation model of the application scenario based on any training algorithm of multiple training algorithms and the training sample set to construct the recommendation model of the application scenario, and predict the corresponding Recommended metrics.
  • the recommendation model of the application scenario includes a click rate prediction model and an exposure rate prediction model.
  • the click rate prediction model is used to predict the click rate of an advertisement
  • the exposure rate prediction model is used to predict the exposure rate of an advertisement.
  • the online real-time splicing in the related art is not applicable to the scene where the sample label is determined in real time, and does not support multi-cycle historical cache multiplexing, new scenarios, algorithm hot start, and historical data splicing.
  • offline stitching supports historical data analysis to generate samples.
  • Each stitching of samples requires more than 90 historical cycles, and the samples and features are in the order of hundreds of millions, which requires very high performance.
  • Offline splicing cannot meet the needs of large-scale splicing, and it is not suitable for multi-algorithm parallelism, and feature splicing cannot reuse cache.
  • an embodiment of the present application proposes a high-performance splicing method (ie, a method for constructing a neural network model), which adopts a multi-level cache mechanism, that is, a first-level cache and a second-level cache.
  • a high-performance splicing method ie, a method for constructing a neural network model
  • the first-level cache is divided according to the algorithm application scenarios and the features used. Since there are many common features of similar algorithms in the same application scenario, the outer layer can be pre-aggregated to generate an aggregation feature table, and each subsequent algorithm can be used.
  • the aggregated features in the aggregated feature table are preferentially used to avoid each algorithm needing to join samples with all feature tables, which wastes computing resources.
  • the aggregated feature occupies a large amount of storage, and the later calculation process takes a long time to load the feature. This is also compressed and optimized by mapping the feature key (feature identifier) to an integer (int); 2)
  • the second-level cache in the specific algorithm splicing, the results (cache features) of multi-cycle splicing are cached. For example, in the model day update scenario, it is often necessary to splicing samples of more than 3 months, and the full splicing cannot be updated every day.
  • the splicing results of the plaintext feature cache are implemented, and the new periodic samples only need to be incrementally spliced to effectively solve the performance problem.
  • the last spliced feature needs to be encoded to train the algorithm.
  • the hash encoding method is adopted for the spliced feature keys of multiple cycles. Compared with the statistical encoding of the full amount of features, the performance of the embodiment of the present application is significantly improved.
  • the embodiments of the present application can be applied to various machine learning modeling scenarios.
  • the offline configuration algorithm corresponds to samples, features, the number of cycles to be spliced, etc., and completes sample, feature splicing, and coding to generate algorithm training data (training samples).
  • Modeling by application scenario the features used by algorithms in the same application scenario have a high degree of coincidence, and the first-level cache can be performed for the scene dimension, and the algorithm set in the scenario can share the aggregated feature table of the first-level cache.
  • the second-level cache is set for the internal splicing of the algorithm, such as the daily update model. If the feature does not change, it is only necessary to incrementally splicing a new cycle to realize the routine splicing calculation.
  • FIG. 7 is a schematic flowchart of offline feature splicing provided by an embodiment of the present application.
  • feature splicing There are three elements of feature splicing: algorithm, sample, and feature.
  • algorithm Based on the feature used by the algorithm, the sample and feature are spliced (that is, in the database). join operation), join the sample with all feature tables, generate the splicing result, and input it to the algorithm for the training process. Due to the many parallel algorithms, a large number of join operations are required, and the performance is poor. First, the join operations on multiple tables of hundreds of millions are time-consuming, and each algorithm needs to be calculated once. For example, in scenario 1, algorithm 1 and algorithm 2 are actually large.
  • each algorithm needs to perform multi-table feature join. If k cycles are required, such an operation needs to be performed in the splicing of each algorithm. Repeating k times requires unbearable computing resources and high time consumption.
  • FIG. 8 is a schematic flowchart of the first-level cache provided by an embodiment of the present application.
  • the same modeling application scenario has the following characteristics: there are many algorithms, and the features used by the algorithm have a high degree of coincidence.
  • the feature processing logic is placed in the pre-common module.
  • Pre-aggregating features based on application scenarios can undoubtedly greatly improve the splicing performance of subsequent algorithms. Multiple algorithms can reuse aggregation feature tables to reduce the time-consuming of multi-table joins.
  • the application scenario often involves adding, deleting, and modifying features. For a new aggregation cycle, new features can be added to the cache, but the features of the historical cycle cannot be recalculated because the online tasks are already using the cache. The stakes are high.
  • the embodiments of the present application propose the following solutions:
  • the cache of the aggregated historical period remains unchanged to ensure that the cache used for online splicing is not affected by recalculation.
  • the deleted feature processing is relatively simple. If the original feature is offline, the historical period in the aggregated feature table remains unchanged, and the deleted feature aggregated by the new period will not appear in the splicing result. For new features and modified features, because the splicing features of the historical cache have been generated and cannot be recalculated, add accompanying tasks, and add new and modified features to the accompanying task-configuration meta information. The calculation and storage of these features are related to The main process is isolated, and the computing state is maintained through the configuration metadata database.
  • the main task is A
  • the accompanying task is B
  • the metadata table of the main task is meta_A
  • the metadata table of the accompanying task is meta_B
  • feature_3 is the newly added feature
  • feature_4 is the modified feature
  • the A task aggregation feature table is merge_A
  • the B task aggregation feature table is merge_B.
  • meta_A and meta_B When the specific algorithm performs the splicing calculation, it will load meta_A and meta_B at the same time, first obtain new features feature_3 and modified features feature_4 from meta_B, load features from merge_B, and then take features feature_1 and feature_2 from meta_A and merge_A to ensure historical cycle features It can be obtained from the right place, and the two sets of task calculation and storage are isolated, which does not affect the execution of online tasks.
  • Isolate storage and computing maintain two sets of mechanisms, and deal with issues such as feature additions, deletions, and changes, and historical cache cycle data consistency.
  • Feature pre-aggregation is the pre-process of subsequent computing tasks. If the model day is updated, the aggregation task also requires high performance. In the application scenario, the original features are often divided into: month, week, and day. If only full aggregation is used, the task needs to be calculated routinely every day to aggregate all features. The aggregated features are distributed in multiple different wide tables. Multi-table join operations are also required, and the tables are all in the order of 1 billion+. The feature quantity is more than 10,000 features according to the scene dimension. Dozens of external tables need to be spliced together, and the storage has reached several 10T. The total amount of calculation every day is a waste of resources. and time consuming is unacceptable.
  • FIG. 9 is a schematic flowchart of the periodic feature aggregation provided by the embodiment of the present application.
  • the embodiment of the present application divides the original features according to the update cycle, and each day, week, and month have a set of aggregation tasks.
  • the monthly feature table only needs to be calculated once a month
  • the weekly feature table only needs to be calculated once a week, so as to aggregate the features of the corresponding period respectively, and then start the day-level merge task.
  • the monthly, weekly, and daily periodic feature tables are used to compare the performance of full aggregation and periodic aggregation, and the performance is improved by more than 15 times, which is a good solution to the feature aggregation performance problem in practical application scenarios.
  • Pre-aggregation optimization has a significant effect on performance improvement. Since the features used by all algorithms in a scene are aggregated, the storage of aggregated features will be particularly large. The feature tables have more than 1 billion records and more than 1,000 features, and more than 50 terabytes of aggregate tables are generated every day. Therefore, the loading time of subsequent computing links is extremely high and requires a lot of computer resources. Through analysis, it is found in the embodiment of this application that the storage is mainly due to the high proportion of feature keys. In order to make the feature description easier to understand, the named feature keys are often very long. In this regard, the character key is replaced by the method of hierarchical shaping and encoding of the feature key. String, use (pkey, skey, value) to identify a feature.
  • pkey can be assigned a coding interval
  • skey corresponds to pkey
  • the feature interval of each skey is assigned codes from 0, and does not need to be uniformly allocated.
  • FIG. 10 is a schematic flowchart of the splicing history feature provided by the embodiment of the present application. Quantitative splicing, the historical cycle features use the spliced cache data, merge with the new cycle features, and then perform unified coding to generate the feature format for algorithm training. The performance improvement of this method is obvious.
  • the result of splicing historical cycles and features (splicing cache features) is cached in plaintext pairs, and the new cycle directly loads the cache without re-executing the calculation for each cycle.
  • the general model iterative features are gradually increased. For example, 300 features have been selected as the baseline. In the iterative optimization, it is often necessary to gradually analyze the selected features, and then add the model.
  • the stitching feature of the history cache also does not have this new feature. In this regard, the cached data of the historical period is reused. For the newly added features, only the splicing of the newly added features is calculated in each cycle, merged with the splicing of the cache, and then the cache is reloaded. times increase.
  • the spliced feature keys need to be converted into the format of algorithm training, statistical methods can be used, that is, the full feature keys (such as reading interest -> entertainment, sports, Category features such as gossip, discrete features such as age, continuous features such as consumption) are counted, and a globally unique index is assigned to each feature key.
  • the statistical encoding is replaced by a hash method.
  • the hash function is essentially a mapping relationship
  • the full encoding time of O(n) complexity can be optimized to O(1).
  • the embodiment of the present application uses a hash function with uniform distribution and low conflict rate to generate this mapping relationship, which essentially also improves the coding calculation performance while ensuring a low conflict rate.
  • the features include category features and continuous features.
  • the discrete type of category features is expressed as ⁇ feature dimension, category ID, feature value ⁇ , such as ⁇ reading interest, entertainment, 0.3 ⁇ ; continuous features are expressed as ⁇ dimension, feature value ⁇ , such as ⁇ ctr, 0.1 ⁇ .
  • the embodiment of the present application adopts the structure of [primary key, secondary key, feature value] (ie [pkey, skey, value]) to express features, and each feature is expressed by a 2-level index , such as the above category features ⁇ reading interest, entertainment, 0.3 ⁇ can use [(pkey, reading interest), (skey, entertainment), (value: 0.3)], continuous features use [(pkey, ctr), (skey, * ), (value: 0.3)] to express.
  • a 2-level index such as the above category features ⁇ reading interest, entertainment, 0.3 ⁇ can use [(pkey, reading interest), (skey, entertainment), (value: 0.3)]
  • continuous features use [(pkey, ctr), (skey, * ), (value: 0.3)] to express.
  • FIG. 9 is a schematic flowchart of a 2-level index provided by an embodiment of the present application.
  • This embodiment of the present application uses feature data in actual services for testing.
  • the mapping interval of the primary key is 5-10 times the number of dimensions of the primary key
  • the mapping interval of the secondary key is 2-5 times the maximum number of secondary keys, which can ensure a lower conflict rate.
  • the splicing feature encoding performance optimization scheme of the embodiment of the present application has very high efficiency (calculation conversion), can eliminate this stage from becoming a system bottleneck (can handle 100 million dimension features), and is simple to maintain (does not require Key-ID mapping data, especially online services). avoid data outbound, state management, etc.).
  • the high-performance splicing method proposed in the embodiments of the present application can solve the problem of splicing performance of large-scale samples and features.
  • Effective multiplexing of common modules through multi-level caching, incremental splicing, etc. avoids the need for each algorithm to join samples with all feature tables; for aggregated features, the feature key encoding is used to effectively compress and store, reducing the loading overhead of the calculation link after the aggregated features ;
  • Hash the spliced features to generate an index to avoid the high time-consuming of full aggregation calculations, improve the overall performance by more than 5 times, and greatly reduce the computing resources of multi-algorithm experiments in the splicing link.
  • each functional module in the apparatus for constructing a recommendation model may be composed of hardware resources of electronic devices (such as terminal devices, servers, or server clusters), such as computing resources such as processors. , communication resources (for example, to support the realization of communication in various ways such as optical cable and cellular), and memory cooperative realization.
  • FIG. 1 A block diagram illustrating an exemplary application and implementation of the server provided by the embodiment of the present application.
  • the embodiment of the present application also provides an apparatus for constructing a recommendation model.
  • each functional module in the apparatus for constructing a recommendation model may be composed of hardware resources of electronic devices (such as terminal devices, servers, or server clusters), such as computing resources such as processors. , communication resources (for example, to support the realization of communication in various ways such as optical cable and cellular), and memory cooperative realization.
  • FIG. 2 shows a device 555 for constructing a recommendation model stored in the memory 550, which can be software in the form of programs and plug-ins, for example, software modules designed in programming languages such as software C/C++, Java, C/C++, Java, etc.
  • Example 1 The device for building a recommendation model is a mobile application and module
  • the device 555 for constructing a recommendation model in this embodiment of the present application can be provided as a software module designed using a programming language such as software C/C++, Java, etc., and embedded in various mobile terminal applications based on systems such as Android or iOS (with executable instructions).
  • Stored in the storage medium of the mobile terminal and executed by the processor of the mobile terminal so as to directly use the computing resources of the mobile terminal to complete the task of building the relevant recommendation model, and periodically or irregularly transmit the processing results through various network communication methods Give it to a remote server, or save it locally on the mobile terminal.
  • Example 2 The device for building a recommendation model is a server application and a platform
  • the device 555 for constructing a recommendation model in this embodiment of the present application may be provided as application software designed using programming languages such as C/C++, Java, or a dedicated software module in a large-scale software system, running on the server side (in the form of executable instructions in the It is stored in the storage medium on the server side and run by the processor on the server side), and the server uses its own computing resources to complete related information recommendation tasks.
  • application software designed using programming languages such as C/C++, Java, or a dedicated software module in a large-scale software system, running on the server side (in the form of executable instructions in the It is stored in the storage medium on the server side and run by the processor on the server side), and the server uses its own computing resources to complete related information recommendation tasks.
  • the embodiments of the present application can also be provided as a distributed and parallel computing platform composed of multiple servers, equipped with a customized, easy-to-interact web (Web) interface or other user interfaces (UI, User Interface) to form a user interface for personal, Information recommendation platforms used by groups or units, etc.
  • Web easy-to-interact web
  • UI User Interface
  • Example 3 The devices for constructing the recommendation model are server-side Application Program Interface (API, Application Program Interface) and plug-ins
  • the device 555 for constructing a recommendation model in this embodiment of the present application may be provided as a server-side API or plug-in for the user to invoke to execute the method for constructing a recommendation model in this embodiment of the present application, and be embedded in various application programs.
  • Example 4 The device for constructing the recommendation model is the mobile device client API and plug-in
  • the apparatus 555 for constructing a recommendation model in the embodiment of the present application may be provided as an API or a plug-in on the mobile device side for the user to call, so as to execute the method for constructing a recommendation model in the embodiment of the present application.
  • Example 5 The device for constructing the recommendation model is an open cloud service
  • the apparatus 555 for constructing a recommendation model in this embodiment of the present application may provide a cloud service constructed for a recommendation model developed for a user, so that an individual, a group, or an organization can obtain a recommendation list.
  • the apparatus 555 for constructing a recommendation model includes a series of modules, including a first aggregation module 5551 , a first splicing module 5552 and a first training module 5553 .
  • the following continues to describe the solution for implementing the recommendation model building by cooperation of each module in the apparatus 555 for building a recommendation model provided by the embodiment of the present application.
  • the first aggregation module 5551 is configured to perform aggregation processing on multiple feature tables corresponding to each application scenario in the recommended item, and send the obtained aggregated feature table to the cache space; wherein the recommended item includes multiple items related to the item to be recommended.
  • the recommended item includes multiple items related to the item to be recommended.
  • the first splicing module 5552 is configured to be based on the user ID and item ID included in the sample data table, from the described
  • the corresponding user features and item features are read from the aggregation feature table in the cache space, and spliced with the sample data table to form a training sample set;
  • the first training module 5553 is configured to train based on the training sample set In the recommendation model of the application scenario, the trained recommendation model can fit the user features and item features in the training sample set.
  • the first aggregation module 5551 is further configured to perform the following processing for each application scenario in the recommended item: perform aggregation processing and deduplication on at least part of the features of the multiple feature tables corresponding to the application scenario processing, to obtain an aggregated feature table of the application scenario; combining and processing each feature identifier in the aggregated feature table to obtain a feature metadata table of the application scenario.
  • the first aggregation module 5551 is further configured to perform aggregation processing on all the features of the multiple feature tables corresponding to the application scenario.
  • the first aggregation module 5551 is further configured to determine, from multiple feature tables corresponding to each application scenario in the recommended item, common to multiple training algorithms used to train the recommendation model of the application scenario features; perform aggregation processing on the common features to obtain the aggregated feature table of the application scenario.
  • the first splicing module 5552 is further configured to, when the corresponding user feature or item feature is not read from the aggregated feature table in the cache space, select the corresponding user feature or item feature from the application scenario.
  • the corresponding user features or item features are read from each feature table, and spliced with the sample data table to form a training sample set.
  • the aggregation module 5551 is further configured to perform a splicing process on the newly added feature table and the aggregated feature table of the application scenario to obtain a new aggregation feature table; incrementally update the cache space based on the new aggregated feature table.
  • the aggregation module 5551 is further configured to, when each new period of each application scenario arrives, perform aggregation processing on the newly added feature table corresponding to the new period to obtain the aggregation of the new period feature table; perform splicing processing on the aggregated feature table of each new period to obtain the aggregated feature table of the application scenario.
  • the aggregation module 5551 is further configured to incrementally update the cache space based on the aggregation feature table of the new cycle; read the history corresponding to the new cycle in the cache space
  • the aggregation characteristic table of the period is used as the aggregation characteristic table of the new period; wherein, the historical period corresponding to the new period is the period before the new period.
  • the splicing module 5552 is further configured to perform the following processing when each new cycle of each application scenario arrives: based on the user identification and item identification included in the sample data table of the new cycle, from the Read the corresponding user features and item features from the aggregation feature table of the new cycle of the cache space; splicing the user features, the item features and the sample data table of the new cycle to obtain the cache features of the new cycle .
  • the splicing module 5552 is further configured to incrementally update the cache space based on the cache characteristics of the new period; read the historical period corresponding to the new period from the cache space Multiple cache features; perform splicing processing on multiple cache features of the historical period and the cache features of the new period to obtain a training sample set of the application scenario; wherein, the historical period is before the new period cycle.
  • the aggregating module 5551 is further configured to perform mapping processing on the feature identifiers in the obtained aggregated feature table to obtain the shaping value of the feature identifiers; and update the feature identifiers in the aggregated feature table to the specified value.
  • the shaping value is obtained to obtain a compressed aggregate feature table; and the compressed aggregate feature table is sent to the cache space.
  • the training module 5553 is further configured to perform primary key encoding on the feature identifier of each training sample in the training sample set to obtain the primary key encoding value of the feature identifier; Perform auxiliary key encoding on the feature identifier to obtain the auxiliary key encoding value of the feature identifier; splicing the primary key encoding value and the auxiliary key encoding value to obtain the index encoding value of the feature identifier;
  • the feature identifier is updated to the index code value, and an updated training sample is obtained, so as to train the recommendation model of the application scenario based on the updated training sample.
  • the apparatus for constructing a neural network model includes a series of modules, including a second aggregation module, a second splicing module and a second training module.
  • a second aggregation module including a second aggregation module, a second splicing module and a second training module.
  • the second aggregation module is configured to perform aggregation processing on multiple feature tables corresponding to each application scenario in the application project, and send the obtained aggregated feature table to the cache space; wherein, the application project includes a plurality of application indicators one by one.
  • the neural network model of each application scenario is used to predict the corresponding application index; the second splicing module is configured to be based on the feature identifier included in the sample data table, from the aggregated feature table of the cache space.
  • the corresponding features are read in the data sheet, and spliced with the sample data table to form a training sample set; the second training module is configured to train the neural network model of the application scenario based on the training sample set; wherein, after training The neural network model can fit the features in the training sample set.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method for constructing a recommendation model or the method for constructing a neural network model described in the embodiments of the present application.
  • the embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by the processor, the processor will cause the processor to execute the building recommendation model provided by the embodiments of the present application.
  • method or a method for building a neural network model for example, the method for building a neural network model as shown in Figures 3-4, and the method for building a recommendation model as shown in Figure 6.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及计算机可读存储介质,所述方法包括:对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间(101);基于样本数据表包括的特征标识,从缓存空间的聚合特征表中读取对应的特征,并与样本数据表进行拼接处理,形成训练样本集(102);基于训练样本集训练应用场景的神经网络模型,以使神经网络模型能够拟合训练样本集中的特征(103)。

Description

构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请实施例基于申请号为202010919935.5、申请日为2020年09月04日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请实施例作为参考。
技术领域
本申请涉及人工智能技术,尤其涉及一种构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及计算机可读存储介质。
背景技术
人工智能(AI,Artificial Intelligence)是计算机科学的一个综合技术,通过研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,例如自然语言处理技术以及机器学习/深度学习等几大方向,随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
相关技术中基于人工智能构建神经网络模型,例如推荐模型,通过推荐模型能够在信息过载的环境中帮助用户发现可能令他们感兴趣的信息,并将信息推送给对它们感兴趣的用户。
但是,相关技术中构建推荐模型需要耗费大量的计算机资源,构建推荐模型的效率太低。
发明内容
本申请实施例提供一种构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及计算机可读存储介质,能够预聚合应用场景对应的多个特征表,提高构建推荐模型的效率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种构建推荐模型的方法,包括:
对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;
其中,所述推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标;
基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,形成训练样本集;
基于所述训练样本集训练所述应用场景的推荐模型;
其中,训练后的所述推荐模型能够拟合所述训练样本集中的用户特征和物品特征。
本申请实施例提供一种构建神经网络模型的方法,包括:
对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;
其中,所述应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标;
基于样本数据表包括的特征标识,从所述缓存空间的所述聚合特征表中读取对应的特征,并与所述样本数据表进行拼接处理,形成训练样本集;
基于所述训练样本集训练所述应用场景的神经网络模型;
其中,训练后的所述神经网络模型能够拟合所述训练样本集中的特征。
本申请实施例提供一种构建推荐模型的装置,包括:
第一聚合模块,配置为对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标;
第一拼接模块,配置为基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,形成训练样本集;
第一训练模块,配置为基于所述训练样本集训练所述应用场景的推荐模型,其中,训练后的所述推荐模型能够拟合所述训练样本集中的用户特征和物品特征。
本申请实施例提供一种构建神经网络模型的装置,包括:
第二聚合模块,配置为对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标;
第二拼接模块,配置为基于样本数据表包括的特征标识,从所述缓存空间的所述聚合特征表中读取对应的特征,并与所述样本数据表进行拼接处理,形成训练样本集;
第二训练模块,配置为基于所述训练样本集训练所述应用场景的神经网络模型,其中,训练后的所述神经网络模型能够拟合所述训练样本集中的特征。
本申请实施例提供一种用于推荐模型构建的电子设备,所述电子设备包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的构建推荐模型的方法。
本申请实施例提供一种用于神经网络模型构建的电子设备,所述电子设备包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的构建神经网络模型的方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的构建推荐模型的方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的构建神经网络模型的方法。
本申请实施例具有以下有益效果:
通过对每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间,从而能够复用聚合特征表,对应用场景的神经网络模型进行训练,减小计算机资源的浪费,以提高构建神经网络模型的效率。
附图说明
图1是本申请实施例提供的推荐系统的应用场景示意图;
图2是本申请实施例提供的用于推荐模型构建的电子设备的结构示意图;
图3-图4是本申请实施例提供的构建神经网络模型的方法的流程示意图;
图5是本申请实施例提供的拼接缓存特征的流程示意图;
图6是本申请实施例提供的构建推荐模型的方法的流程示意图;
图7是本申请实施例提供的离线特征拼接的流程示意图;
图8是本申请实施例提供的第一级缓存的流程示意图;
图9是本申请实施例提供的分周期特征聚合的流程示意图;
图10是本申请实施例提供的拼接历史特征的流程示意图;
图11是本申请实施例提供的2级索引的流程示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)特征拼接:机器学习建模流程前置准备工作,包括原始特征与样本准备,特征分布在多个存储位置,且与样本未进行关联,不能直接进行算法训练输入,需要将样本与特征拼接处理再进行模型训练。
2)存储压缩:业务场景特征key(特征标识)往往以字符形式存储,为了更好的理解特征,字符串会比较长,多数建模场景使用的特征量特别庞大,如果在拼接过程中都以字符形式存储将会带来很大存储开销。将特征key通过一定的哈希算法进行整形映射,生成存储空间更小的整形值。
3)特征索引哈希(Feature Index Hash):特征(来源于特征表)与样本(来源于样本数据表)拼接完成后,需要对特征key进行编码,生成对应的索引(index),直接输入到模型进行训练,为了提升特征index生成性能,采用哈希的方法,以生成哈希值,其时间复杂度为O(1),从而大幅提升性能。
4)多周期缓存:算法训练往往需要样本与多个周期的特征进行拼接,为了丰富样本量取得更好的建模效果,多个周期拼接时,如果每次全量拼接,性能特别低下,需要进行缓存优化,可以缓存历史已拼接周期数据,以用于后续进行增量拼接。
5)样本:本申请实施例中的样本表示原始标识数据,例如样本包括用户标识、物品标识、标签、权重等标识数据。对应的,本申请实施例中的特征表示与样本进行关联的实体数据,例如特征包括用户画像的特征、物品的点击特征、文本的统计特征等。
6)数据清洗:对提供的原始数据进行一定的加工,方便后续的特征抽取。数据清洗包括数据拼接,由于提供的数据散落在多个文件,需要根据相应的键值进行数据的拼接。
本申请实施例记载的神经网络模型可以应用于各种领域,例如可以是图像识别神经网络、文本推荐神经网络等,即本申请实施例中的神经网络模型并不局限于某种领域。
相关技术中,对模型进行训练之前,需要特征拼接,特征拼接是机器学习建模重要模块,工业界建模场景往往面临样本、特征规模庞大,需要并行运行多个算法,拼接性能对建模效率影响很大。其中,特征拼接技术包括在线实时拼接以及离线拼接。
其中,在线实时拼接是通过实时用户行为生成标签,例如点击率场景,往往点击了的样本作为正样本、曝光但没点击的样本作为负样本,再与特征进行关联,生成算法训练需要的输入。但是,在线实时拼接不适用样本标签实时确定的场景、不支持多周期历史缓存复用、新场景、算法热启动、历史数据拼接。
虽然,离线拼接支持历史数据分析生成样本。但是,针对建模场景,需要同时并行运行多个算法,以天粒度更新模型,每次拼接样本都需要90多的历史周期,且样本与特征都是数亿量级,对性能要求非常高。而离线拼接满足不了大规模拼接需求,不适应多算法并行,特征拼接没法复用缓存。
为了解决上述问题,本申请实施例提供了一种构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及计算机可读存储介质,能够预聚合应用场景对应的多个特征表,从而使得训练算法可以复用聚合特征表,对应用场景的神经网络模型进行训练,减小计算机资源的浪费,以提高构建推荐模型的效率。
本申请实施例所提供的构建神经网络模型的方法,可以由终端/服务器独自实现;也可以由终端和服务器协同实现,例如终端独自承担下文所述的构建神经网络模型的方法,或者,终端向服务器发送针对神经网络模型的训练请求,服务器根据接收的针对神经网络模型的训练请求执行构建神经网络模型的方法,并向终端发送生成的神经网络模型,以通过神经网络模型预测对应的应用指标。
本申请实施例提供的用于神经网络模型构建的电子设备可以是各种类型的终端设备或服务器,其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器;终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
以服务器为例,例如可以是部署在云端的服务器集群,向用户开放人工智能云服务(AIaaS,AI as a Service),AIaaS平台会把几类常见的AI服务进行拆分,并在云端提供独立或者打包的服务,这种服务模式类似于一个AI主题商城,所有的用户都可以通过应用程序编程接口的方式来接入使用AIaaS平台提供的一种或者多种人工智能服务。
例如,其中的一种人工智能云服务可以为神经网络模型构建服务,即云端的服务器封装有本申请实施例提供的神经网络模型构建的程序。开发人员通过终端(运行有客户端,例如配置客户端)调用云服务中的神经网络模型构建服务,以使部署在云端的服务器调用封装的神经网络模型构建的程序,从缓存空间的聚合特征表中读取对应的特征,并与样本数据表进行拼接,以形成训练样本集,并基于任一训练算法以及训练样本集训练应用场景的神经网络模型,以响应针对神经网络模型的训练请求,通过神经网络模型用于预测对应的应用指标,该神经网络模型可以是图像神经网络模型、文本神经网络模型等。
在一个实施场景中,为了构建图像识别神经网络模型,服务器或者终端可以对图像识别项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间,其中,图像项目包括与待识别对象的多个预测指标一一对应的多个应用场景,每个应用场景的图像识别神经网络模型用于预测对应的预测指标,基于样本数据表包括的对象标识,从缓存空间的聚合特征表中读取对应的对象特征,并与样本数据表进行拼接,以形成训练样本集,基于多个训练算法的任一训练算法以及训练样本集训练应用场 景的图像识别神经网络模型,以使图像识别神经网络模型能够拟合训练样本集中的对象特征,其中,多个训练算法用于训练应用场景的图像识别神经网络模型,以通过图像识别神经网络模型预测对应的预测指标。
例如,在人脸识别系统中,调用本申请实施例提供的神经网络模型构建的功能,图像识别项目包括正脸识别场景以及侧脸识别场景,例如针对正脸,基于样本数据表包括的正脸标识,从缓存空间的聚合特征表中读取对应的正脸特征,并与样本数据表进行拼接,以形成训练样本集,基于多个训练算法的任一训练算法以及训练样本集训练正脸识别场景的正脸识别神经网络模型,以通过正脸识别神经网络模型预测对应的正脸指标,例如属于某用户的正脸的概率。本申请实施例可以结合正脸识别神经网络模型以及侧脸识别神经网络模型,对需要通过门禁的行人进行正脸识别以及侧脸识别,以提高人脸识别的准确性,加强门禁的安全系数。
在一个实施场景中,为了构建文本推荐神经网络模型,服务器或者终端可以对文本推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间,其中,图像项目包括与待推荐文本的多个推荐指标一一对应的多个应用场景,每个应用场景的文本推荐神经网络模型用于预测对应的推荐指标,基于样本数据表包括的用户标识和文本标识,从缓存空间的聚合特征表中读取对应的用户特征和文本特征,并与样本数据表进行拼接,以形成训练样本集,基于多个训练算法的任一训练算法以及训练样本集训练应用场景的文本推荐神经网络模型,以使文本推荐神经网络模型能够拟合训练样本集中的对象特征,其中,多个训练算法用于训练应用场景的文本推荐神经网络模型,以通过文本推荐神经网络模型预测对应的推荐指标。
例如,在新闻推荐系统中,调用本申请实施例提供的神经网络模型构建的功能,文本推荐项目包括新闻点击率预测场景以及新闻曝光率率预测场景,例如针对新闻点击率,基于样本数据表包括的用户标识和新闻标识,从缓存空间的聚合特征表中读取对应的用户特征和新闻特征,并与样本数据表进行拼接,以形成训练样本集,基于多个训练算法的任一训练算法以及训练样本集训练新闻点击率场景的点击率预测模型,以通过点击率预测模型预测对应的点击率。本申请实施例可以结合点击率预测模型以及曝光率预测模型,对新闻的点击率以及曝光率进行预测,结合新闻的点击率以及曝光率,确定是否推荐该新闻,以提高新闻推荐的准确性,向用户推荐更符合用户兴趣的新闻。
下面具体结合推荐模型进行说明,参见图1,图1是本申请实施例提供的推荐系统10的应用场景示意图,终端200通过网络300连接服务器100,网络300可以是广域网或者局域网,又或者是二者的组合。
终端200(运行有客户端,例如配置客户端)可以被用来获取针对推荐模型的训练请求,例如,用户在客户端中输入应用场景对应的多个特征表后,终端自动获取针对推荐模型的训练请求。
在一些实施例中,终端中运行的客户端中可以植入有推荐模型构建插件,用以在客户端本地实现构建推荐模型的方法。例如,终端200获取针对推荐模型的训练请求后,调用推荐模型构建插件,以实现构建推荐模型的方法,从缓存空间的聚合特征表中读取对应的用户特征以及物品特征,并与样本数据表进行拼接,以形成训练样本集,并基于任一训练算法以及训练样本集训练应用场景的推荐模型,以响应针对推荐模型的训练请求,后续根据推荐模型预测对应的推荐指标,例如通过推荐模型预测商品的曝光率,并基于商品的曝光率确定是否推荐该商品,从而帮助用户发现可能令他们感兴趣的商品。
在一些实施例中,终端200获取针对推荐模型的训练请求后,调用服务器100的推荐模型构建接口(可以提供为云服务的形式,即推荐模型构建服务),服务器100,从缓存空间的聚合特征表中读取对应的用户特征以及物品特征,并与样本数据表进行拼接, 以形成训练样本集,并基于任一训练算法以及训练样本集训练应用场景的推荐模型,以响应针对推荐模型的训练请求。
在一些实施例中,终端或服务器可以通过运行计算机程序来实现本申请实施例提供的构建推荐模型的方法,例如,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序;还可以是能够嵌入至任意APP中的小程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
下面说明本申请实施例提供的用于推荐模型构建的电子设备的结构,参见图2,图2是本申请实施例提供的用于推荐模型构建的电子设备500的结构示意图,以电子设备500是服务器为例说明,图2所示的用于推荐模型构建的电子设备500包括:至少一个处理器510、存储器550以及至少一个网络接口520。电子设备500中的各个组件通过总线系统530耦合在一起。可理解,总线系统530用于实现这些组件之间的连接通信。总线系统530除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统530。
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块553,用于经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
在一些实施例中,本申请实施例提供的构建推荐模型的装置可以采用软件方式实现,例如,可以是上文所述的终端中的推荐模型构建插件,可以是上文所述的服务器中推荐模型构建服务。当然,不局限于此,本申请实施例提供的构建推荐模型的装置可以提供为各种软件实施例,包括应用程序、软件、软件模块、脚本或代码在内的各种形式。
图2示出了存储在存储器550中的构建推荐模型的装置555,其可以是程序和插件等形式的软件,例如推荐模型构建插件,并包括一系列的模块,包括第一聚合模块5551、第一拼接模块5552以及第一训练模块5553;其中,第一聚合模块5551、第一拼接模块5552以及第一训练模块5553用于实现本申请实施例提供的推荐模型构建功能。
如前,本申请实施例提供的构建神经网络模型的方法可以由各种类型的电子设备实施,例如服务器。参见图3,图3是本申请实施例提供的构建神经网络模型的方法的流程示意图,结合图3示出的步骤进行说明。
需要说明的是,本申请实施例记载的神经网络模型可以应用于各种领域,例如可以 是图像识别神经网络、文本推荐神经网络等,即本申请实施例中的神经网络模型并不局限于某种领域。
在下面的步骤中,应用项目表示具体的应用任务,例如人脸识别、文本推荐等。应用场景表示应用项目下的子任务对应的场景,例如人脸识别场景下的正脸识别场景和侧脸识别场景。特征表包括多个特征键值对(key-value)。
在步骤101中,对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间。
其中,应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标。
例如,服务器的中央处理器(CPU,Central Processing Unit)预先从文件系统中获取应用项目中每个应用场景对应的多个特征表,并对多个特征表进行预聚合处理,以得到应用场景的聚合特征表,并将聚合特征表发送至缓存空间(高速存储器、数据库等),以便后续复用聚合特征表,避免每次训练都需要对特征表进行拼接,降低计算复杂度。
在一些实施例中,对推荐项目中每个应用场景对应的多个特征表进行聚合处理,包括:针对应用项目的每个应用场景执行以下处理:对应用场景对应的多个特征表的至少部分特征进行聚合处理和去重处理,得到应用场景的聚合特征表。
例如,由于聚合特征表是一个宽表,为了便于后续能够快速确定聚合特征表中需要拼接的特征,可以对聚合特征表中的各个特征标识进行组合,以得到应用场景的特征元数据表,该特征元数据表中包括特征标识、特征的来源(特征来源于哪个特征表)、特征在聚合特征表的位置等信息。后续可以先从特征元数据表中索引特征,当索引到特征时,则说明该特征存在于聚合特征表,从而从聚合特征表中读取该特征。
在一些实施例中,对应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:对应用场景对应的多个特征表的全部特征进行聚合处理。
例如,中央处理器读取多个特征表的全部特征,并将全部特征进行聚合,从而得到包含全部特征的聚合特征表,因此,只要没有新增特征表,后续的拼接操作,都可以直接从聚合特征表中读取特征,即聚合特征表不会遗漏特征表中的任一特征。
在一些实施例中,对应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:从应用项目中每个应用场景对应的多个特征表中,确定用于训练应用场景的推荐模型的多个训练算法所公用的特征;对公用的特征进行聚合处理,以得到应用场景的聚合特征表。
例如,由于每个训练算法所用到的特征并不相同,因此,可以将多个训练算法所公用的特征进行预聚合处理,即使用频率较高的特征进行聚合处理,以得到应用场景的聚合特征表,从而提高聚合特征表中特征的读取频率,并减小聚合特征表的大小,以便后续能够快速读取聚合特征表中的特征。
在一些实施例中,当应用场景存在新增特征表时,对新增特征表以及应用场景的聚合特征表进行拼接处理,得到新的聚合特征表,并基于新的聚合特征表对缓存空间进行增量更新。
例如,当中央处理器检测到文件系统存在新增特征表时,可以从缓存空间中读取应用场景的聚合特征表,并从文件系统中读取新增特征表,将新增特征表以及应用场景的聚合特征表进行拼接,以得到新的聚合特征表,基于新的聚合特征表,对缓存空间进行全新更新,以替代旧的聚合特征表。
在一些实施例中,对应用项目中每个应用场景对应的多个特征表进行聚合处理,包括:当每个应用场景的每个新周期到达时,对与新周期对应的新增特征表进行聚合处理,得到新周期的聚合特征表;对每个新周期的聚合特征表进行拼接处理,得到应用场景的 聚合特征表。
例如,每个应用场景中的特征表有对应的更新周期,例如月更新周期、周更新周期、日更新周期,当每个应用场景的每个新周期到达时,对与新周期对应的多个特征表进行聚合处理,得到新周期的聚合特征表,例如对于每个月,对月更新周期的特征表进行聚合,得到每个月的聚合特征表;对于每个周,对周更新周期的特征表进行聚合,得到每个周的聚合特征表;对于每天,对天更新周期的特征表进行聚合,得到每天的聚合特征表。对每个新周期的聚合特征表进行拼接处理,得到应用场景的聚合特征表。
需要说明的是,对原始的特征按更新周期进行划分,天、周、月分别有一套聚合任务,月特征表只需要每月计算一次,周特征表只需要每周计算一次,日特征表只需要每天计算一次,从而分别聚合对应周期的特征,再启动天级别合并任务,因此,不需要将所有特征表中特征进行聚合,而是仅仅将月、周、日结果表(聚合特征表,例如月聚合特征表、周聚合特征表、日聚合特征表)的聚合特征进行合并,通过这种方式进行计算,可以大幅提升性能。例如,分别聚合对应周期的特征,得到6月的聚合特征表、第1周的聚合特征表、6月1号的聚合特征表,再启动天级别合并任务,即对6月的聚合特征表、第1周的聚合特征表、6月1号的聚合特征表进行聚合,得到应用场景的聚合特征表,从而避免对于天更新周期,需要每天对应用场景中所有的特征表进行聚合,降低计算资源。
在一些实施例中,在中央处理器得到新周期的聚合特征表后,基于新周期的聚合特征表,对缓存空间进行增量更新;当每个应用场景的任一新周期未到达时,对每个新周期的聚合特征表进行拼接处理之前,在缓存空间中读取与新周期对应的历史周期的聚合特征表,并作为新周期的聚合特征表;其中,新周期对应的历史周期为新周期之前的周期。
其中,新周期的聚合特征表作为对应下一新周期的历史周期的聚合特征表。当每个应用场景的任一新周期未到达时,在缓存空间中读取与新周期对应的历史周期的聚合特征表,并作为新周期的聚合特征表,例如,对于月更新周期,当还未到达每个月的更新时刻,则在缓存空间中读取与这个月最近的月聚合特征表(历史周期的聚合特征表),并将与这个月最近的月聚合特征表作为这个月的聚合特征表,从而对每个新周期的聚合特征表进行拼接处理,以得到应用场景的聚合特征表。
参见图4,图4是本申请实施例提供的构建神经网络模型的方法的流程示意图,图4示出图3中的步骤101可以通过图4示出的步骤1011至步骤1013实现:在步骤1011中,对得到的聚合特征表中的特征标识进行映射处理,得到特征标识的整形值;在步骤1012中,将聚合特征表中的特征标识更新为整形值,得到压缩的聚合特征表;在步骤1013中,将压缩的聚合特征表发送至缓存空间。
例如,由于聚合特征表是一个宽表,为了便于理解,特征标识比较长,因此,聚合特征表的存储空间较大。为了节约聚合特征表的存储空间,可以对聚合特征表中的特征标识进行映射处理,得到特征标识的整形值,减小特征标识的长度,将聚合特征表中的特征标识更新为整形值,得到压缩的聚合特征表;并将压缩的聚合特征表发送至缓存空间,从而降低聚合特征表对缓存空间的占用,并提高聚合特征表的读取能力。
在步骤102中,基于样本数据表包括的特征标识,从缓存空间的聚合特征表中读取对应的特征,并与样本数据表进行拼接处理,形成训练样本集。
其中,样本数据表包括多个样本,每个样本包括多个特征标识。在进行拼接操作之前,开发人员可以在终端输入样本数据表和训练算法,终端自动获取基于样本数据表和训练算法的训练请求,并将基于样本数据表和训练算法的训练请求发送至服务器,服务器接收到基于样本数据表和训练算法的训练请求后,基于样本数据表包括的特征标识, 从缓存空间的聚合特征表中读取对应的特征,并与样本数据表进行拼接,以形成训练样本集,从而实现聚合特征表的复用,快速进行样本数据表的拼接操作。
在一些实施例中,当从缓存空间的聚合特征表中未读取到对应的特征时,从应用场景对应的多个特征表中读取对应的特征,并与样本数据表进行拼接,形成训练样本集。
例如,中央处理器优先从缓存空间的特征表中读取与样本数据表对应的特征,当从缓存空间的特征表中未读取与样本数据表对应的特征时,再从文件系统中读取应用场景对应的多个特征表,并从多个特征表中读取样本数据表对应的特征,避免每次都要从文件系统中读取特征,通过从缓存空间中读取特征,可以大大提高特征的读取速率。
在一些实施例中,基于样本数据表包括的特征标识,从缓存空间的聚合特征表中读取对应的特征,并与样本数据表进行拼接处理,包括:当每个应用场景的每个新周期到达时,执行以下处理:基于新周期的样本数据表包括的特征标识,从缓存空间的新周期的聚合特征表中读取对应的特征;将特征与新周期的样本数据表进行拼接处理,得到新周期的缓存特征。
例如,多数应用场景下建模需要多个历史周期的样本,以构造足够多的训练数据,让算法学习到更多的知识,提升健壮性。例如天更新模型,需要1个月(30个周期)的样本,则需要记录30个周期的样本数据,在30个周期的任一周期到达时,基于该周期的样本数据表包括的特征标识,从缓存空间的该周期的聚合特征表中读取对应的特征,并将特征与该周期的样本数据表进行拼接,得到该周期的缓存特征。
在一些实施例中,在中央处理器得到新周期的缓存特征后,基于新周期的缓存特征,对缓存空间进行增量更新;从缓存空间中读取对应新周期的历史周期的多个缓存特征;将历史周期的多个缓存特征与新周期的缓存特征进行拼接处理,得到应用场景的训练样本集;其中,历史周期为新周期之前的周期。
其中,新周期的缓存特征作为对应下一新周期的历史周期的缓存特征。如图5所示,在中央处理器得到第30个周期的缓存特征(新周期的缓存特征)后,从缓存空间中获取第1-29个周期的缓存特征,将第1-30个周期的缓存特征进行拼接,以得到应用场景的训练样本集。
在一些实施例中,形成训练样本集之后,对训练样本集的每个训练样本的特征标识进行主键编码,得到特征标识的主键编码值;对每个训练样本的特征标识进行辅键编码,得到特征标识的辅键编码值;对主键编码值以及辅键编码值进行拼接,得到特征标识的索引编码值;将训练样本的特征标识更新为索引编码值,得到更新的训练样本,以基于更新的训练样本训练应用场景的推荐模型。
例如,在得到训练样本集后,中央处理器对训练样本集中的每个训练样本进行编码处理,以得到用于模型训练的格式,并对训练样本进行压缩,以减小训练样本的存储空间。通过对训练样本的特征标识进行主键编码以及辅键编码,即2级编码,得到特征标识的索引编码值,该索引编码值可以有效降低编码的冲突率,即索引编码值可唯一表征该特征标识。
在步骤103中,基于训练样本集训练应用场景的神经网络模型,以使神经网络模型能够拟合训练样本集中的特征。
其中,基于多个训练算法的任一训练算法以及训练样本集训练应用场景的神经网络模型,以使神经网络模型能够拟合训练样本集中的特征,多个训练算法用于训练应用场景的神经网络模型,训练算法包括模型的超参数、损失函数等。在中央处理器得到训练样本集后,可以基于多个训练算法的任一训练算法以及训练样本集训练应用场景的神经网络模型,以构建应用场景的神经网络模型,通过该应用场景的神经网络模型预测对应的应用指标。例如,应用场景的神经网络模型包括正脸识别神经网络模型以及侧脸识别 神经网络模型,正脸识别神经网络模型用于预测正脸的类别(属于某用户的正脸的概率),侧脸识别神经网络模型用于预测侧脸的类别(属于某用户的侧脸的概率)。
下面结合本申请实施例提供的电子设备的示例性应用和实施,说明本申请实施例提供的构建推荐模型的方法。参见图6,图6是本申请实施例提供的构建推荐模型的方法的流程示意图,结合图6示出的步骤进行说明。
在下面的步骤中,推荐项目表示具体的推荐对象,例如新闻、商品、视频等。
在步骤201中,对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间。
其中,推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标。
例如,服务器的中央处理器预先从文件系统中获取推荐项目中每个应用场景对应的多个特征表,并对多个特征表进行预聚合处理,以得到应用场景的聚合特征表,并将聚合特征表发送至缓存空间,以便后续复用聚合特征表,避免每次训练都需要对特征表进行拼接,降低计算复杂度。
在一些实施例中,对推荐项目中每个应用场景对应的多个特征表进行聚合处理,包括:针对推荐项目的每个应用场景执行以下处理:对应用场景对应的多个特征表的至少部分特征进行聚合处理和去重处理,得到应用场景的聚合特征表;对聚合特征表中的各个特征标识进行组合,得到应用场景的特征元数据表。后续可以先从特征元数据表中索引特征,当索引到特征时,则说明该特征存在于聚合特征表,从而从聚合特征表中读取该特征。
在一些实施例中,对应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:对应用场景对应的多个特征表的全部特征进行聚合处理。
在一些实施例中,对应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:从推荐项目中每个应用场景对应的多个特征表中,确定用于训练应用场景的推荐模型的多个训练算法所公用的特征;对公用的特征进行聚合处理,以得到应用场景的聚合特征表。
在一些实施例中,当应用场景存在新增特征表时,对新增特征表以及应用场景的聚合特征表进行拼接处理,得到新的聚合特征表,并基于新的聚合特征表对缓存空间进行增量更新。
例如,当中央处理器检测到文件系统存在新增特征表时,可以从缓存空间中读取应用场景的聚合特征表,并从文件系统中读取新增特征表,将新增特征表以及应用场景的聚合特征表进行拼接,以得到新的聚合特征表,基于新的聚合特征表,对缓存空间进行全新更新,以替代旧的聚合特征表。
在一些实施例中,对推荐项目中每个应用场景对应的多个特征表进行聚合处理,包括:当每个应用场景的每个新周期到达时,对与新周期对应的新增特征表进行聚合处理,得到新周期的聚合特征表;对每个新周期的聚合特征表进行拼接处理,得到应用场景的聚合特征表。
在一些实施例中,在中央处理器得到新周期的聚合特征表后,基于新周期的聚合特征表,对缓存空间进行增量更新;当每个应用场景的任一新周期未到达时,对每个新周期的聚合特征表进行拼接处理之前,在缓存空间中读取与新周期对应的历史周期的聚合特征表,并作为新周期的聚合特征表;其中,新周期对应的历史周期为新周期之前的周期。
其中,新周期的聚合特征表作为对应下一新周期的历史周期的聚合特征表。当每个应用场景的任一新周期未到达时,在缓存空间中读取与新周期对应的历史周期的聚合特 征表,并作为新周期的聚合特征表,例如,对于月更新周期,当还未到达每个月的更新时刻,则在缓存空间中读取与这个月最近的月聚合特征表(历史周期的聚合特征表),并将与这个月最近的月聚合特征表作为这个月的聚合特征表,从而对每个新周期的聚合特征表进行拼接处理,以得到应用场景的聚合特征表。
在一些实施例中,将得到的聚合特征表发送至缓存空间,包括:对得到的聚合特征表中的特征标识进行映射处理,得到特征标识的整形值;将聚合特征表中的特征标识更新为整形值,得到压缩的聚合特征表;将压缩的聚合特征表发送至缓存空间。
在步骤202中,基于样本数据表包括的用户标识和物品标识,从缓存空间的聚合特征表中读取对应的用户特征以及物品特征,并与样本数据表进行拼接,以形成训练样本集。
其中,样本数据表包括多个样本,每个样本包括多个特征标识,例如用户标识和物品标识。在进行拼接操作之前,开发人员可以在终端输入样本数据表和训练算法,终端自动获取基于样本数据表和训练算法的训练请求,并将基于样本数据表和训练算法的训练请求发送至服务器,服务器接收到基于样本数据表和训练算法的训练请求后,基于样本数据表包括的用户标识和物品标识,从缓存空间的聚合特征表中读取对应的用户特征和物品特征,并与样本数据表进行拼接,以形成训练样本集,从而实现聚合特征表的复用,快速进行样本数据表的拼接操作。
在一些实施例中,当从缓存空间的聚合特征表中未读取到对应的用户特征或者物品特征时,从应用场景对应的多个特征表中读取对应的用户特征或者物品特征,并与样本数据表进行拼接,以形成训练样本集。
例如,中央处理器优先从缓存空间的特征表中读取与样本数据表对应的用户特征或者物品特征,当从缓存空间的特征表中未读取与样本数据表对应的用户特征或者物品特征时,再从文件系统中读取应用场景对应的多个特征表,并从多个特征表中读取样本数据表对应的用户特征或者物品特征,避免每次都要从文件系统中读取用户特征或者物品特征,通过从缓存空间中读取用户特征或者物品特征,可以大大提高特征的读取速率。
在一些实施例中,基于样本数据表包括的用户标识和物品标识,从缓存空间的聚合特征表中读取对应的用户特征以及物品特征,并与样本数据表进行拼接,包括:当每个应用场景的每个新周期到达时,执行以下处理:基于新周期的样本数据表包括的用户标识和物品标识,从缓存空间的新周期的聚合特征表中读取对应的用户特征以及物品特征;将用户特征、物品特征与新周期的样本数据表进行拼接,得到新周期的缓存特征。
例如,多数应用场景下建模需要多个历史周期的样本,以构造足够多的训练数据,让算法学习到更多的知识,提升健壮性。例如天更新模型,需要1个月(30个周期)的样本,则需要记录30个周期的样本数据,在30个周期的任一周期到达时,基于该周期的样本数据表包括的用户标识和物品标识,从缓存空间的该周期的聚合特征表中读取对应的用户特征以及物品特征,并将用户特征以及物品特征与该周期的样本数据表进行拼接,得到该周期的缓存特征。
在一些实施例中,在中央处理器得到新周期的缓存特征后,基于新周期的缓存特征,对缓存空间进行增量更新;从缓存空间中读取对应新周期的历史周期的多个缓存特征;将历史周期的多个缓存特征与新周期的缓存特征进行拼接处理,以得到应用场景的训练样本集;其中,历史周期为新周期之前的周期。
其中,新周期的缓存特征作为对应下一新周期的历史周期的缓存特征。如图5所示,在中央处理器得到第30个周期的缓存特征(新周期的缓存特征)后,从缓存空间中获取第1-29个周期的缓存特征,将第1-30个周期的缓存特征进行拼接,以得到应用场景的训练样本集。
在一些实施例中,形成训练样本集之后,对训练样本集的每个训练样本的特征标识进行主键编码,得到特征标识的主键编码值;对每个训练样本的特征标识进行辅键编码,得到特征标识的辅键编码值;对主键编码值以及辅键编码值进行拼接,得到特征标识的索引编码值;将训练样本的特征标识更新为索引编码值,得到更新的训练样本,以基于更新的训练样本训练应用场景的推荐模型。
在步骤203中,基于训练样本集训练应用场景的推荐模型,以使推荐模型能够拟合训练样本集中的用户特征和物品特征。
其中,基于多个训练算法的任一训练算法以及训练样本集训练应用场景的推荐模型,以使推荐模型能够拟合训练样本集中的用户特征和物品特征,多个训练算法用于训练应用场景的推荐模型,训练算法包括模型的超参数、损失函数等。在中央处理器得到训练样本集后,可以基于多个训练算法的任一训练算法以及训练样本集训练应用场景的推荐模型,以构建应用场景的推荐模型,通过该应用场景的推荐模型预测对应的推荐指标。例如,应用场景的推荐模型包括点击率预测模型以及曝光率预测模型,点击率预测模型用于预测广告的点击率,曝光率预测模型用于预测广告的曝光率。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
相关技术中的在线实时拼接不适用样本标签实时确定的场景、不支持多周期历史缓存复用、新场景、算法热启动、历史数据拼接。虽然,离线拼接支持历史数据分析生成样本。但是,针对建模场景,需要同时并行运行多个算法,以天粒度更新模型,每次拼接样本都需要90多的历史周期,且样本与特征都是数亿量级,对性能要求非常高。而离线拼接满足不了大规模拼接需求,不适应多算法并行,特征拼接没法复用缓存。
为了解决上述问题,本申请实施例提出一种高性能拼接方法(即构建神经网络模型的方法),该方法采用多级缓存机制,即第一级缓存和第二级缓存。其中,1)第一级缓存,按算法应用场景与使用的特征来划分,由于同一应用场景下的相似算法公用的特征比较多,可以进行外层预聚合,生成聚合特征表,后续每个算法在执行具体拼接逻辑时,优先使用聚合特征表中的聚合特征,避免每个算法都需要将样本与所有的特征表连接(join),浪费计算资源。其中,聚合特征占有存储比较特别大,后面计算环节加载特征耗时高,对此也通过特征键(key)(特征标识)映射成整形(int)进行压缩优化;2)第二级缓存,在具体算法拼接中,对多周期拼接的结果(缓存特征)进行缓存,例如模型天更新场景,往往需要拼接3个月以上的样本,而全量拼接是没办法做到天更新的,对历史样本周期的拼接结果进行明文特征缓存落地,新的周期样本只需要进行增量拼接,从而有效解决性能问题。最后拼接的特征需要进行编码才能训练算法,对多个周期的拼接特征key采用hash编码方式,相对于全量特征统计编码,本申请实施例的性能有显著提升。
本申请实施例可应用于各种机器学习建模场景,离线配置算法对应样本、特征、需要拼接的周期数等,完成样本、特征拼接、编码,以生成算法训练数据(训练样本)。分应用场景建模,相同应用场景的算法使用的特征重合度较高,可以针对场景维度进行第一级缓存,场景下算法集可以共享第一级缓存的聚合特征表。算法内部拼接设置第二级缓存,例如天更新模型,如果特征没有变更,只需要增量拼接新的周期即可实现例行化拼接计算。
如图7所示,图7是本申请实施例提供的离线特征拼接的流程示意图,特征拼接三元素:算法、样本、特征,基于算法使用的特征,再将样本与特征进行拼接(即数据库中join操作),将样本与所有的特征表join,生成拼接结果,输入到算法以进行训练流程。由于算法并行多,需要大量join操作性能差,首先多个亿量级表join操作就已经很耗时了,还需要每个算法都计算一次,例如在场景1中,算法1与算法2其实大部分特 征都是重合的,或者特征完全一样,只是算法参数不一样,为了验证效果,每个算法都需要进行多表特征join,如果需要k个周期,那这样的操作在每个算法的拼接需要重复k次,需要的计算资源与高耗时是不可忍受的。
为了解决上述问题,本申请实施例通过多级缓存中的第一级缓存进行预聚合,避免每个算法都需要将样本与所有的特征表join,浪费计算资源。如图8所示,图8是本申请实施例提供的第一级缓存的流程示意图,同个建模应用场景有这样的特点:算法较多,且算法使用的特征重合度高,可以将一些特征处理逻辑放在前置公共模块。对应用场景(例如,场景1为商品点击率预测场景,场景2为商品曝光率预测场景)下所有特征(特征分布在特征表,对特征表进行加载,以获取特征)进行预聚合,生成聚合特征的元数据表(记录每个周期的特征信息,例如已聚合特征的标识)与聚合特征表(包含场景下所有的特征集,即包含所有聚合特征的宽表),算法在拼接计算时,先从元数据表中查看已聚合特征信息,即优先从聚合特征表中加载特征进行拼接,如果有新增特征,则进行旁路逻辑,从原始的特征表中加载新增特征,再与已聚合特征拼接,从而实现多个算法公用预聚合特征表,而不是与多个外部大宽表进行join。
基于应用场景对特征进行预聚合无疑可以大大提升后面算法拼接性能,多个算法可以复用聚合特征表,减少多表join耗时。但是,应用场景经常有增删改特征操作,对于新的聚合周期,可以将新增特征加入缓存,但是历史周期的特征没办法重新计算,因为线上任务已经在使用缓存,如果重新计算带来的风险很大。对此,本申请实施例提出以下解决方法:
一、对于已经聚合的历史周期缓存保持不变,保证线上拼接使用缓存不受重新计算影响。
二、删除的特征处理比较简单,如果原始的特征下线,聚合特征表中历史周期不变,新的周期聚合的被删除的特征不会出现在拼接结果中。对于新增特征与修改特征,因为历史缓存的拼接特征已经生成,不可重新计算,添加伴随任务,新增特征与修改特征加入伴随任务-配置元信息中,对于这部分特征的计算、存储等与主流程隔离开,计算状态通过配置元数据库维护。例如,主任务为A,伴随任务为B,主任务的元数据表为meta_A,伴随任务的元数据表为meta_B,有4个特征:feature_1、feature_2、feature_3、feature_4,feature_1、feature_2例行化在主任务聚合结果中,feature_3为新增特征,feature_4为修改特征,A任务聚合特征表为merge_A,B任务聚合特征表为merge_B。具体算法执行拼接计算时,会同时加载meta_A与meta_B,先从meta_B中获取新增特征feature_3和修改特征feature_4,从merge_B中加载特征,再从meta_A与merge_A中取特征feature_1、feature_2,保证历史周期特征可以从正确的地方获取,2套任务计算与存储隔离,不影响线上任务执行。
三、存储与计算隔离,维护2套机制,应对特征增删改、历史缓存周期数据一致性等问题。
关于特征聚合优化策略:特征预聚合是后面计算任务的前置流程,如果做到模型天更新,聚合任务也需要高性能。应用场景下原始的特征往往分为:月、周、日,如果只使用全量聚合,任务需要每天例行化计算,将所有特征进行聚合,这里聚合的特征分布在多个不同的宽表中,也需要进行多表join操作,且表都是10亿+量级,特征量按场景维度来看10000多的特征,需要几十个外部表拼接,存储也达到了数10T,每天计算全量浪费资源且耗时不可接受。
对此,如图9所示,图9是本申请实施例提供的分周期特征聚合的流程示意图,本申请实施例对原始的特征按更新周期进行划分,天、周、月分别有一套聚合任务,月特征表只需要每月计算一次,周特征表只需要每周计算一次,从而分别聚合对应周期的特 征,再启动天级别合并任务,因此,不需要将所有表中特征进行join,而是仅仅将月、周、日结果表(聚合特征表,例如月聚合特征表、周聚合特征表、日聚合特征表)的聚合特征进行合并,通过这种方式进行计算,可以大幅提升性能。
为了验证分周期特征聚合的效果,对月特征聚合、周特征聚合、日特征聚合、分周期聚合以及全量聚合进行实验,得到如表1所示的实验结果:
表1
Figure PCTCN2021112762-appb-000001
基于实际应用场景数据验证,用月、周、日周期特征表进行全量聚合与分周期聚合对比性能,提升15倍多,很好解决实际应用场景的特征聚合性能问题。
关于存储优化:预聚合优化对性能提升有显著的作用。由于将一个场景下所有算法使用的特征进行聚合,因此聚合特征的存储会特别庞大。特征表都是10亿多条记录,1000多个特征,每天生成的聚合表有50T多个,因此,后续计算环节加载的耗时是极高的,需要大量的计算机资源。通过分析,在本申请实施例中发现存储主要在于特征key的占比较高,为了让特征描述更加好理解,往往命名的特征key很长,对此,通过特征key分级整形编码的方式来替代字符串,用(pkey,skey,value)来标识一个特征,编码中,可以对pkey分配编码区间,skey与pkey对应,每个skey的特征区间从0开始分配编码,且不用统一分配。通过实验,相对于明文特征存储,本申请实施例的特征key分级整形编码优化了70%多。
关于多周期缓存:多数应用场景下建模需要多个历史周期的样本,构造足够多的训练数据,让算法学习到更多的知识,提升健壮性。例如天更新模型,则需要3个月(90个周期)的样本,每天除了训练算法,在拼接环节可能都需要几天时间。这里考虑2种情况,1)例行化算法,特征没有变更,每天例行化拼接、训练、预测;2)特征在模型迭代中逐步新增。
对于第一种情况,本申请实施例采用拼接历史特征明文缓存方式,效率提升很显著,如图10所示,图10是本申请实施例提供的拼接历史特征的流程示意图,新周期特征采用增量拼接,历史周期特征采用拼接好的缓存数据,与新周期特征合并,再进行统一编码,以生成算法训练的特征格式。这种方式性能提升是很明显的,将历史周期与特征拼接的结果(拼接缓存特征)用明文对(pair)缓存下来,新周期直接加载缓存而不用重新执行每个周期的计算。
对于第二种情况提到的新增特征,一般模型迭代特征是逐步增加的,例如已经选取了300个特征作为基线(baseline),在迭代优化中往往需要逐渐分析选取的特征,再加入模型,历史缓存的拼接特征也没有这个新增特征。对此,复用历史周期的缓存数据,对于新增的特征,每个周期只计算新增特征的拼接,与缓存的拼接合并,再重新落地缓存,这种方法相对于全量重新拼接性能也有数倍的提升。
对于拼接特征编码性能优化:在特征拼接最后一个环节,需要将拼接的特征key转换成算法训练的格式,可以采用统计类方法,即先对全量特征key(如:阅读兴趣->娱乐、体育、八卦等类别特征,年龄段等离散特征,消费次数等连续特征)进行统计,为 每一个特征key分配全局唯一index。
然而,全量统计编码操作耗时非常高,即需要对所有历史周期的样本特征进行合并,再将每条样本的特征key展开,在此基础上,一种专为大规模数据处理而设计的快速通用的计算引擎(例如spark)用groupByKey算子来实现进行聚合统计操作,数据量较大的场景包括数亿样本,平均每个样本有数百个特征key,展开后的特征key达到千亿,还可能涉及特征key分布不均匀、数据倾斜等问题。
对此,本申请实施例通过hash方式来取代统计类编码,考虑到hash函数本质上也是一种映射关系,能够将O(n)复杂度的全量编码时间优化为O(1)。对于特征索引编码任务,本申请实施例采用一种分布均匀、低冲突率的hash函数来生成这个映射关系,本质上也为了提升编码计算性能,同时保证了低冲突率。其中,特征包括类别特征和连续特征,类别特征的离散型表达为{特征维度,类别ID,特征值},如{阅读兴趣,娱乐,0.3};连续特征的连续型表达为{维度,特征值},如{ctr,0.1}。为了适配类别特征和连续特征的表达结构,本申请实施例采用[主键,辅键,特征值](即[pkey,skey,value])结构来表达特征,每个特征由2级索引来表达,如上面类别特征{阅读兴趣,娱乐,0.3}可以用[(pkey,阅读兴趣),(skey,娱乐),(value:0.3)],连续特征用[(pkey,ctr),(skey,*),(value:0.3)]来表达。
如图9所示,图9是本申请实施例提供的2级索引的流程示意图,对“性别”进行pkey编码,得到121,对“男性”进行skey编码,得到234,从而得到原始特征“性别为男性”的索引编码值为“234121”。因此,在编码环节,可以对pkey和skey分级编码,大大降低冲突率,在pkey保证不冲突基础上,skey可以有一定的冲突率,skey往往属于一类特征,冲突造成的影响没有那么大。
本申请实施例使用实际业务中特征数据进行测试。其中,主键的映射区间是主键维度数的5-10倍,辅键的映射区间是最大辅键个数的2-5倍,就能保证较低的冲突率。本申请实施例的拼接特征编码性能优化方案的效率非常高(计算转换),可以排除该阶段成为系统瓶颈(能够处理亿维度特征),维护简单(不需要Key-ID映射数据,特别是在线服务时,避免数据出库、状态管理等工作)。
综上,本申请实施例提出的高性能拼接方法能够解决大规模样本与特征拼接性能问题。通过多级缓存,增量拼接等有效复用公用模块,避免每个算法都需要将样本与所有的特征表join;对于聚合特征通过特征key编码有效压缩存储,减小聚合特征后面计算环节加载开销;对拼接好的特征进行hash生成index,避免全量聚合计算的高耗时,整体上性能提升5倍多,也大大减少了多算法实验在拼接环节的计算资源。
至此已经结合本申请实施例提供的服务器的示例性应用和实施,说明本申请实施例提供的构建推荐模型的方法。本申请实施例还提供构建推荐模型的装置,实际应用中,构建推荐模型的装置中的各功能模块可以由电子设备(如终端设备、服务器或服务器集群)的硬件资源,如处理器等计算资源、通信资源(如用于支持实现光缆、蜂窝等各种方式通信)、存储器协同实现。图2示出了存储在存储器550中的构建推荐模型的装置555,其可以是程序和插件等形式的软件,例如,软件C/C++、Java等编程语言设计的软件模块、C/C++、Java等编程语言设计的应用软件或大型软件系统中的专用软件模块、应用程序接口、插件、云服务等实现方式,下面对不同的实现方式举例说明。
示例一、构建推荐模型的装置是移动端应用程序及模块
本申请实施例中的构建推荐模型的装置555可提供为使用软件C/C++、Java等编程语言设计的软件模块,嵌入到基于Android或iOS等系统的各种移动端应用中(以可执行指令存储在移动端的存储介质中,由移动端的处理器执行),从而直接使用移动端自身的计算资源完成相关的推荐模型构建的任务,并且定期或不定期地通过各种网络通信 方式将处理结果传送给远程的服务器,或者在移动端本地保存。
示例二、构建推荐模型的装置是服务器应用程序及平台
本申请实施例中的构建推荐模型的装置555可提供为使用C/C++、Java等编程语言设计的应用软件或大型软件系统中的专用软件模块,运行于服务器端(以可执行指令的方式在服务器端的存储介质中存储,并由服务器端的处理器运行),服务器使用自身的计算资源完成相关的信息推荐任务。
本申请实施例还可以提供为在多台服务器构成的分布式、并行计算平台上,搭载定制的、易于交互的网络(Web)界面或其他各用户界面(UI,User Interface),形成供个人、群体或单位使用的信息推荐平台等。
示例三、构建推荐模型的装置是服务器端应用程序接口(API,Application Program Interface)及插件
本申请实施例中的构建推荐模型的装置555可提供为服务器端的API或插件,以供用户调用,以执行本申请实施例的构建推荐模型的方法,并嵌入到各类应用程序中。
示例四、构建推荐模型的装置是移动设备客户端API及插件
本申请实施例中的构建推荐模型的装置555可提供为移动设备端的API或插件,以供用户调用,以执行本申请实施例的构建推荐模型的方法。
示例五、构建推荐模型的装置是云端开放服务
本申请实施例中的构建推荐模型的装置555可提供为向用户开发的推荐模型构建的云服务,供个人、群体或单位获取推荐列表。
其中,构建推荐模型的装置555包括一系列的模块,包括第一聚合模块5551、第一拼接模块5552以及第一训练模块5553。下面继续说明本申请实施例提供的构建推荐模型的装置555中各个模块配合实现推荐模型构建的方案。
第一聚合模块5551,配置为对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标;第一拼接模块5552,配置为基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,形成训练样本集;第一训练模块5553,配置为基于所述训练样本集训练所述应用场景的推荐模型,其中,训练后的所述推荐模型能够拟合所述训练样本集中的用户特征和物品特征。
在一些实施例中,所述第一聚合模块5551还配置为针对推荐项目中的每个应用场景执行以下处理:对所述应用场景对应的多个特征表的至少部分特征进行聚合处理和去重处理,得到所述应用场景的聚合特征表;对所述聚合特征表中的各个特征标识进行组合处理,得到所述应用场景的特征元数据表。
在一些实施例中,所述第一聚合模块5551还配置为对所述应用场景对应的多个特征表的全部特征进行聚合处理。
在一些实施例中,所述第一聚合模块5551还配置为从推荐项目中每个应用场景对应的多个特征表中,确定用于训练所述应用场景的推荐模型的多个训练算法所公用的特征;对所述公用的特征进行聚合处理,得到所述应用场景的聚合特征表。
在一些实施例中,所述第一拼接模块5552还配置为当从所述缓存空间的所述聚合特征表中未读取到对应的用户特征或者物品特征时,从所述应用场景对应的多个特征表中读取对应的用户特征或者物品特征,并与所述样本数据表进行拼接处理,形成训练样本集。
在一些实施例中,当所述应用场景存在新增特征表时,所述聚合模块5551还配置 为对所述新增特征表以及所述应用场景的聚合特征表进行拼接处理,得到新的聚合特征表;基于所述新的聚合特征表对所述缓存空间进行增量更新。
在一些实施例中,所述聚合模块5551还配置为当每个应用场景的每个新周期到达时,对与所述新周期对应的新增特征表进行聚合处理,得到所述新周期的聚合特征表;对每个所述新周期的聚合特征表进行拼接处理,得到所述应用场景的聚合特征表。
在一些实施例中,所述聚合模块5551还配置为基于所述新周期的聚合特征表,对所述缓存空间进行增量更新;在所述缓存空间中读取与所述新周期对应的历史周期的聚合特征表,并作为所述新周期的聚合特征表;其中,所述新周期对应的历史周期为所述新周期之前的周期。
在一些实施例中,所述拼接模块5552还配置为当每个应用场景的每个新周期到达时,执行以下处理:基于所述新周期的样本数据表包括的用户标识和物品标识,从所述缓存空间的新周期的聚合特征表中读取对应的用户特征以及物品特征;将所述用户特征、所述物品特征与所述新周期的样本数据表进行拼接处理,得到新周期的缓存特征。
在一些实施例中,所述拼接模块5552还配置为基于所述新周期的缓存特征,对所述缓存空间进行增量更新;从所述缓存空间中读取对应所述新周期的历史周期的多个缓存特征;将所述历史周期的多个缓存特征与所述新周期的缓存特征进行拼接处理,以得到所述应用场景的训练样本集;其中,所述历史周期为所述新周期之前的周期。
在一些实施例中,所述聚合模块5551还配置为对得到的聚合特征表中的特征标识进行映射处理,得到所述特征标识的整形值;将所述聚合特征表中的特征标识更新为所述整形值,得到压缩的聚合特征表;将所述压缩的聚合特征表发送至缓存空间。
在一些实施例中,所述训练模块5553还配置为对所述训练样本集的每个训练样本的特征标识进行主键编码,得到所述特征标识的主键编码值;对所述每个训练样本的特征标识进行辅键编码,得到所述特征标识的辅键编码值;对所述主键编码值以及所述辅键编码值进行拼接,得到所述特征标识的索引编码值;将所述训练样本的特征标识更新为所述索引编码值,得到更新的训练样本,以基于所述更新的训练样本训练所述应用场景的推荐模型。
其中,构建神经网络模型的装置包括一系列的模块,包括第二聚合模块、第二拼接模块以及第二训练模块。下面继续说明本申请实施例提供的构建神经网络模型的装置中各个模块配合实现神经网络模型构建的方案。
第二聚合模块,配置为对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标;第二拼接模块,配置为基于样本数据表包括的特征标识,从所述缓存空间的所述聚合特征表中读取对应的特征,并与所述样本数据表进行拼接处理,形成训练样本集;第二训练模块,配置为基于所述训练样本集训练所述应用场景的神经网络模型;其中,训练后的所述神经网络模型能够拟合所述训练样本集中的特征。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的构建推荐模型的方法或者构建神经网络模型的方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的构建推荐模型的方法或者构建神经网络模型的方法,例如,如图3-4示出的构建神经网络模型的方法,如图6示出的构建推荐模型的方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (17)

  1. 一种构建推荐模型的方法,所述方法包括:
    对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;
    其中,所述推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标;
    基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,形成训练样本集;
    基于所述训练样本集训练所述应用场景的推荐模型;
    其中,训练后的所述推荐模型能够拟合所述训练样本集中的用户特征和物品特征。
  2. 根据权利要求1所述的方法,其中,
    所述对推荐项目中每个应用场景对应的多个特征表进行聚合处理,包括:
    针对推荐项目中的每个应用场景执行以下处理:
    对所述应用场景对应的多个特征表的至少部分特征进行聚合处理和去重处理,得到所述应用场景的聚合特征表;
    所述方法还包括:
    对所述聚合特征表中的各个特征标识进行组合处理,得到所述应用场景的特征元数据表。
  3. 根据权利要求2所述的方法,其中,所述对所述应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:
    对所述应用场景对应的多个特征表的全部特征进行聚合处理。
  4. 根据权利要求2所述的方法,其中,所述对所述应用场景对应的多个特征表的至少部分特征进行聚合处理,包括:
    从推荐项目中每个应用场景对应的多个特征表中,确定用于训练所述应用场景的推荐模型的多个训练算法所公用的特征;
    对所述公用的特征进行聚合处理,得到所述应用场景的聚合特征表。
  5. 根据权利要求4所述的方法,其中,所述方法还包括:
    当从所述缓存空间的所述聚合特征表中未读取到对应的用户特征或者物品特征时,从所述应用场景对应的多个特征表中读取对应的用户特征或者物品特征,并与所述样本数据表进行拼接处理,形成训练样本集。
  6. 根据权利要求1所述的方法,其中,当所述应用场景存在新增特征表时,所述方法还包括:
    对所述新增特征表以及所述应用场景的聚合特征表进行拼接处理,得到新的聚合特征表;
    基于所述新的聚合特征表对所述缓存空间进行增量更新。
  7. 根据权利要求1所述的方法,其中,所述对推荐项目中每个应用场景对应的多个特征表进行聚合处理,包括:
    当每个应用场景的每个新周期到达时,对与所述新周期对应的新增特征表进行聚合处理,得到所述新周期的聚合特征表;
    对每个所述新周期的聚合特征表进行拼接处理,得到所述应用场景的聚合特征表。
  8. 根据权利要求7所述的方法,其中,所述方法还包括:
    基于所述新周期的聚合特征表,对所述缓存空间进行增量更新;
    当每个应用场景的任一新周期未到达时,所述对每个所述新周期的聚合特征表进行 拼接处理之前,所述方法还包括:
    在所述缓存空间中读取与所述新周期对应的历史周期的聚合特征表,并作为所述新周期的聚合特征表;
    其中,所述新周期对应的历史周期为所述新周期之前的周期。
  9. 根据权利要求1所述的方法,其中,所述基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,包括:
    当每个应用场景的每个新周期到达时,执行以下处理:
    基于所述新周期的样本数据表包括的用户标识和物品标识,从所述缓存空间的新周期的聚合特征表中读取对应的用户特征以及物品特征;
    将所述用户特征、所述物品特征与所述新周期的样本数据表进行拼接处理,得到新周期的缓存特征。
  10. 根据权利要求9所述的方法,其中,
    所述方法还包括:
    基于所述新周期的缓存特征,对所述缓存空间进行增量更新;
    所述形成训练样本集,包括:
    从所述缓存空间中读取对应所述新周期的历史周期的多个缓存特征;
    将所述历史周期的多个缓存特征与所述新周期的缓存特征进行拼接处理,以得到所述应用场景的训练样本集;
    其中,所述历史周期为所述新周期之前的周期。
  11. 根据权利要求1所述的方法,其中,所述将得到的聚合特征表发送至缓存空间,包括:
    对得到的聚合特征表中的特征标识进行映射处理,得到所述特征标识的整形值;
    将所述聚合特征表中的特征标识更新为所述整形值,得到压缩的聚合特征表;
    将所述压缩的聚合特征表发送至缓存空间。
  12. 根据权利要求1所述的方法,其中,所述形成训练样本集之后,所述方法还包括:
    对所述训练样本集的每个训练样本的特征标识进行主键编码,得到所述特征标识的主键编码值;
    对所述每个训练样本的特征标识进行辅键编码,得到所述特征标识的辅键编码值;
    对所述主键编码值以及所述辅键编码值进行拼接处理,得到所述特征标识的索引编码值;
    将所述训练样本的特征标识更新为所述索引编码值,得到更新的训练样本;
    基于所述更新的训练样本训练所述应用场景的推荐模型。
  13. 一种构建神经网络模型的方法,所述方法包括:
    对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;
    其中,所述应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标;
    基于样本数据表包括的特征标识,从所述缓存空间的所述聚合特征表中读取对应的特征,并与所述样本数据表进行拼接处理,形成训练样本集;
    基于多个训练算法的任一所述训练算法以及所述训练样本集训练所述应用场景的神经网络模型;
    其中,训练后的所述神经网络模型能够拟合所述训练样本集中的特征。
  14. 一种构建推荐模型的装置,所述装置包括:
    第一聚合模块,配置为对推荐项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述推荐项目包括与待推荐物品的多个推荐指标一一对应的多个应用场景,每个应用场景的推荐模型用于预测对应的推荐指标;
    第一拼接模块,配置为基于样本数据表包括的用户标识和物品标识,从所述缓存空间的所述聚合特征表中读取对应的用户特征以及物品特征,并与所述样本数据表进行拼接处理,形成训练样本集;
    第一训练模块,配置为基于多个训练算法的任一所述训练算法以及所述训练样本集训练所述应用场景的推荐模型,其中,训练后的所述推荐模型能够拟合所述训练样本集中的用户特征和物品特征。
  15. 一种构建推荐模型的装置,所述装置包括:
    第二聚合模块,配置为对应用项目中每个应用场景对应的多个特征表进行聚合处理,将得到的聚合特征表发送至缓存空间;其中,所述应用项目包括与多个应用指标一一对应的多个应用场景,每个应用场景的神经网络模型用于预测对应的应用指标;
    第二拼接模块,配置为基于样本数据表包括的特征标识,从所述缓存空间的所述聚合特征表中读取对应的特征,并与所述样本数据表进行拼接处理,形成训练样本集;
    第二训练模块,配置为基于多个训练算法的任一所述训练算法以及所述训练样本集训练所述应用场景的神经网络模型;其中,训练后的所述神经网络模型能够拟合所述训练样本集中的特征。
  16. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至12任一项所述的构建推荐模型的方法,或者权利要求13所述的构建神经网络模型的方法。
  17. 一种计算机可读存储介质,存储有可执行指令,用于被处理器执行时,实现权利要求1至12任一项所述的构建推荐模型的方法,或者权利要求13所述的构建神经网络模型的方法。
PCT/CN2021/112762 2020-09-04 2021-08-16 构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质 WO2022048432A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/072,622 US20230094293A1 (en) 2020-09-04 2022-11-30 Method and apparatus for constructing recommendation model and neural network model, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010919935.5 2020-09-04
CN202010919935.5A CN114154048A (zh) 2020-09-04 2020-09-04 构建推荐模型的方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/072,622 Continuation US20230094293A1 (en) 2020-09-04 2022-11-30 Method and apparatus for constructing recommendation model and neural network model, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022048432A1 true WO2022048432A1 (zh) 2022-03-10

Family

ID=80460251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112762 WO2022048432A1 (zh) 2020-09-04 2021-08-16 构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质

Country Status (3)

Country Link
US (1) US20230094293A1 (zh)
CN (1) CN114154048A (zh)
WO (1) WO2022048432A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579584A (zh) * 2022-05-06 2022-06-03 腾讯科技(深圳)有限公司 数据表处理方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160371589A1 (en) * 2015-06-17 2016-12-22 Yahoo! Inc. Systems and methods for online content recommendation
US20180247362A1 (en) * 2017-02-24 2018-08-30 Sap Se Optimized recommendation engine
CN109241418A (zh) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 基于随机森林的异常用户识别方法及装置、设备、介质
CN109241031A (zh) * 2018-08-14 2019-01-18 腾讯科技(深圳)有限公司 模型生成方法、模型使用方法、装置、系统及存储介质
CN111026971A (zh) * 2019-12-25 2020-04-17 腾讯科技(深圳)有限公司 内容的推送方法及装置、计算机存储介质
CN111582596A (zh) * 2020-05-14 2020-08-25 公安部交通管理科学研究所 融合交通状态信息的纯电动汽车续航里程风险预警方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160371589A1 (en) * 2015-06-17 2016-12-22 Yahoo! Inc. Systems and methods for online content recommendation
US20180247362A1 (en) * 2017-02-24 2018-08-30 Sap Se Optimized recommendation engine
CN109241031A (zh) * 2018-08-14 2019-01-18 腾讯科技(深圳)有限公司 模型生成方法、模型使用方法、装置、系统及存储介质
CN109241418A (zh) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 基于随机森林的异常用户识别方法及装置、设备、介质
CN111026971A (zh) * 2019-12-25 2020-04-17 腾讯科技(深圳)有限公司 内容的推送方法及装置、计算机存储介质
CN111582596A (zh) * 2020-05-14 2020-08-25 公安部交通管理科学研究所 融合交通状态信息的纯电动汽车续航里程风险预警方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579584A (zh) * 2022-05-06 2022-06-03 腾讯科技(深圳)有限公司 数据表处理方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
US20230094293A1 (en) 2023-03-30
CN114154048A (zh) 2022-03-08

Similar Documents

Publication Publication Date Title
WO2021213293A1 (zh) 一种面向群智感知的泛在操作系统
US20190050756A1 (en) Machine learning service
US11645548B1 (en) Automated cloud data and technology solution delivery using machine learning and artificial intelligence modeling
US9684869B2 (en) Infrastructure and architecture for development and execution of predictive models
CN111339073A (zh) 实时数据处理方法、装置、电子设备及可读存储介质
CN111444181B (zh) 知识图谱更新方法、装置及电子设备
CN114416855A (zh) 一种基于电力大数据的可视化平台及方法
CN102880683A (zh) 一种可行性研究报告的自动网络生成系统及其生成方法
CN113836131A (zh) 一种大数据清洗方法、装置、计算机设备及存储介质
CN111651524B (zh) 利用机器学习模型进行线上预测的辅助实现方法及装置
CN115564071A (zh) 一种电力物联网设备数据标签生成方法及系统
CN106570153A (zh) 一种海量url的数据提取方法及系统
WO2022048432A1 (zh) 构建推荐模型的方法、构建神经网络模型的方法、装置、电子设备及存储介质
CN116860856A (zh) 一种财务数据处理方法、装置、计算机设备及存储介质
CN110020310A (zh) 资源加载的方法、装置、终端及存储介质
CN112214602B (zh) 基于幽默度的文本分类方法、装置、电子设备及存储介质
Shi et al. Human resources balanced allocation method based on deep learning algorithm
WO2021115269A1 (zh) 用户集群的预测方法、装置、计算机设备和存储介质
CN114611712B (zh) 基于异构联邦学习的预测方法、模型生成方法及装置
WO2024139703A1 (zh) 对象识别模型的更新方法、装置、电子设备、存储介质及计算机程序产品
WO2023045636A1 (zh) 基于流水线的机器学习方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN117992241B (zh) 基于大数据的科技型中小企业银企对接服务系统及方法
US20240211619A1 (en) Determining collaboration recommendations from file path information
Shan Data management and sharing mechanism of e-commerce industry based on association rule mining in big data era
Tang Construction and Application of Big Data Marketing System of Tourism Economy Based on Hadoop Framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863499

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21863499

Country of ref document: EP

Kind code of ref document: A1