WO2024109907A1

WO2024109907A1 - Quantization method and apparatus, and recommendation method and apparatus

Info

Publication number: WO2024109907A1
Application number: PCT/CN2023/133825
Authority: WO
Inventors: 郭慧丰; 李世伟; 侯璐; 章伟; 唐睿明
Original assignee: 华为技术有限公司
Priority date: 2022-11-25
Filing date: 2023-11-24
Publication date: 2024-05-30
Also published as: CN115983362A

Abstract

Provided in the present application are a quantization method and apparatus, and a recommendation method and apparatus, which are used for quantizing each feature in a full-precision embedding representation on the basis of an adaptive step size, so as to improve the quantization precision. The method comprises: first acquiring a full-precision embedding representation, the embedding representation comprising a plurality of features; determining an adaptive step size separately corresponding to each feature amongst the plurality of features, wherein the step sizes corresponding to the plurality of features may be the same or different; and then, according to the adaptive step size corresponding to each feature, respectively quantizing the plurality of features to obtain a low-precision embedding representation, wherein the precision of the features in the low-precision embedding representation is lower than the precision of the features in the full-precision embedding representation, thus reducing a storage space required for storing or transmitting the embedding representation.

Description

A quantification method, a recommendation method and a device

This application claims priority to the Chinese patent application filed with the China Patent Office on November 25, 2022, with application number 202211490535.2 and application name “A Quantification Method, Recommendation Method and Device”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of computers, and in particular to a quantification method, a recommendation method and a device.

Background technique

Machine learning systems, including personalized recommendation systems, train the parameters of machine learning models based on input data and labels through optimization methods such as gradient descent. After the model parameters converge, the model can be used to predict unknown data.

For example, taking the click rate prediction model in the recommendation system as an example, the model can usually include an embedding layer and a multilayer perceptron (MLP) layer. The embedding layer is usually used to map high-dimensional sparse data to low-dimensional dense vectors, and the MLP is usually used to fit the combination relationship between features, sequence information, or click rate, etc. However, for some large-scale data scenarios, the input data volume of the recommendation model is usually very large, so the scale of the embedding layer is very large, resulting in a large amount of storage space required during storage and training.

Summary of the invention

The present application provides a quantization method, a recommended method and a device for quantizing each feature in a full-precision embedded representation based on an adaptive step size, thereby improving the quantization accuracy.

In view of this, in a first aspect, the present application provides a quantization method, comprising: first, obtaining a full-precision embedded representation, the embedded representation including multiple features; determining an adaptive step size corresponding to each of the multiple features, the step sizes corresponding to the multiple features may be the same or different; then quantizing the multiple features according to the adaptive step size corresponding to each feature, to obtain a low-precision embedded representation, the accuracy of the features in the low-precision embedded representation is lower than the accuracy of the features in the full-precision embedded representation, so that the storage resources or transmission resources required to save or transmit the low-precision embedded representation are lower than the storage resources required to save or transmit the full-precision embedded representation, thereby reducing the storage space required to save or transmit the embedded representation.

In the implementation manner of the present application, in the process of quantizing the full-precision embedded representation, the adaptive step size corresponding to each feature can be calculated, and quantization can be performed based on the adaptive step size corresponding to each feature, thereby improving the quantization accuracy and avoiding the loss of accuracy caused by the fixed step size. For example, when a certain feature is updated less frequently, if a fixed step size is used, the quantization accuracy of the less updated part may be reduced due to the step size. However, through the quantization method provided by the present application, each feature has a corresponding adaptive step size, which matches the length of each feature or the amount of updated data, thereby avoiding data loss during quantization and improving quantization accuracy.

In one possible implementation, a low-precision embedding representation vocabulary is applied to a neural network, and the aforementioned acquisition of the full-precision embedding representation vocabulary may include: acquiring a representation corresponding to the input data of the current iteration from the low-precision embedding representation vocabulary to obtain a low-precision embedding representation of the current iteration; and dequantizing the low-precision embedding representation of the current iteration to obtain a full-precision embedding representation of the current iteration.

Therefore, the quantization method provided in the present application can be applied to quantization in the process of neural network training. In each iteration, a low-precision embedded representation is transmitted, and a full-precision embedded representation can be obtained by dequantizing it through the corresponding adaptive step size, thereby achieving full-precision restoration of the low-precision embedded representation and obtaining a lossless full-precision embedded representation, which can reduce the storage space occupied by the embedded representation during the neural network training process.

In a possible implementation, the aforementioned determination of the adaptive step size corresponding to each of the multiple features may include: using the full-precision embedding representation of the current iteration as the input of the neural network to obtain the full-precision gradient corresponding to the prediction result of the current iteration; obtaining an updated full-precision embedding representation based on the full-precision gradient to obtain an updated full-precision embedding representation; obtaining the adaptive step size corresponding to each feature in the updated full-precision embedding representation based on the full-precision gradient.

In the implementation mode of the present application, during the training process of the neural network, the adaptive step size corresponding to each feature can be determined based on the full-precision gradient, so that the step size can be adaptively updated to obtain an adaptive step size that matches each feature. This can avoid reducing the quantization accuracy due to the small update amount in the embedded representation, and can improve the quantization accuracy.

In a possible implementation, the aforementioned quantizing of multiple features according to the adaptive step size corresponding to each feature includes: quantizing multiple features in the full-precision low-dimensional representation of the current iteration according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation.

Therefore, in the implementation manner of the present application, an adaptive step size obtained based on full-precision gradient calculation can be used for quantization, so as to synchronously quantize the embedded representation during the training process.

In a possible implementation, the method provided in the present application may further include: updating a low-precision embedding representation vocabulary according to the low-precision embedding representation to obtain an updated low-precision embedding representation vocabulary.

After quantization is performed to obtain a new low-precision embedding representation, the new low-precision embedding representation can be written back into the low-precision embedding representation vocabulary to facilitate subsequent low-precision storage or transmission.

In a possible implementation, the aforementioned determination of the adaptive step length corresponding to each of the multiple features may include: calculating the adaptive step length corresponding to each feature by using a heuristic algorithm.

In the implementation manner of the present application, the adaptive step size can be calculated by a heuristic algorithm, which can be applicable to the scenario of storing a low-precision embedded representation vocabulary.

In a possible implementation, the aforementioned calculation of the adaptive step size corresponding to each feature by a heuristic algorithm may include: calculating the adaptive step size corresponding to each feature according to the absolute value of the weight in each feature. Therefore, the adaptive step size can be calculated based on the weight value of each feature itself without relying on external data.

In a possible implementation, the aforementioned quantizing of multiple features according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation vocabulary may also include: obtaining discrete features of each feature according to the adaptive step size corresponding to each feature; truncating the discrete features of each feature by a random truncation algorithm to obtain a low-precision embedded representation.

In the implementation manner of the present application, each feature can be truncated by a random truncation algorithm, so that effective features can be adaptively retained and quantization accuracy can be improved.

In a possible implementation, the low-precision embedding representation vocabulary is applied to a language model or a recommendation model, the language model is used to obtain the semantic information of the corpus, and the recommendation model is used to generate recommendation information based on the user's information. Therefore, the method provided in this application can be applied to natural language processing or recommendation scenarios, etc.

In a second aspect, the present application provides a recommendation method, comprising: obtaining input data, the input data including data generated by a user for at least one behavior of a terminal; obtaining a low-precision embedded representation corresponding to the input data from a low-precision embedded representation vocabulary, the low-precision embedded representation including multiple features; dequantizing the multiple features according to an adaptive step size corresponding to each of the multiple features to obtain a full-precision embedded representation, and the adaptive step size may be an adaptive step size obtained when quantizing the full-precision embedded representation; outputting recommendation information based on the full-precision embedded representation as input to a neural network, and the recommendation information is used to make recommendations for at least one behavior of the user.

In the implementation of the present application, during the reasoning process of the neural network, the low-precision embedded representation can be dequantized using an adaptive step size to obtain a full-precision embedded representation, so that the low-precision can be saved or transmitted during the reasoning process, and the full-precision embedded representation can be obtained by losslessly restoring the adaptive step size. This can reduce the storage space occupied by the embedded representation vocabulary and perform lossless restoration when used.

In a possible implementation, the neural network includes a language model or a recommendation model, the language model is used to obtain semantic information of the corpus, and the recommendation model is used to generate recommendation information based on user information.

In a third aspect, the present application provides a quantization device, comprising:

An acquisition module is used to acquire a full-precision embedded representation, where the embedded representation includes multiple features;

A determination module, used to determine the adaptive step size corresponding to each of the multiple features;

The quantization module is used to quantize multiple features according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation, where the accuracy of the features in the low-precision embedded representation is lower than the accuracy of the features in the full-precision embedded representation.

In one possible implementation, a low-precision embedding representation vocabulary is applied to a neural network.

The acquisition module is specifically used to obtain the representation corresponding to the input data of the current iteration from the low-precision embedding representation vocabulary to obtain the low-precision embedding representation of the current iteration; dequantize the low-precision embedding representation of the current iteration to obtain the full-precision embedding representation of the current iteration.

In one possible implementation, the determination module is specifically used to: use the full-precision embedding representation of the current iteration as the input of the neural network to obtain the full-precision gradient corresponding to the prediction result of the current iteration; obtain the updated full-precision embedding representation according to the full-precision gradient to obtain the updated full-precision embedding representation; obtain the adaptive step size corresponding to each feature in the updated full-precision embedding representation according to the full-precision gradient.

In a possible implementation, the quantization module is specifically configured to quantize the current iteration according to the adaptive step size corresponding to each feature. The multiple features in the full-precision low-dimensional representation of the previous generation are quantized to obtain a low-precision embedded representation.

In a possible implementation, the acquisition module is further configured to update the low-precision embedding representation vocabulary according to the low-precision embedding representation to obtain an updated low-precision embedding representation vocabulary.

In a possible implementation, the determination module is specifically configured to calculate the adaptive step size corresponding to each feature by using a heuristic algorithm.

In a possible implementation, the determination module is specifically configured to calculate the adaptive step size corresponding to each feature according to the absolute value of the weight in each feature.

In a possible implementation, the quantization module is specifically used to: obtain a discrete feature of each feature according to an adaptive step size corresponding to each feature; and truncate the discrete feature of each feature by a random truncation algorithm to obtain a low-precision embedded representation.

In a possible implementation, the low-precision embedding representation vocabulary is applied to a language model or a recommendation model. The language model is used to obtain semantic information of the corpus, and the recommendation model is used to generate recommendation information based on user information.

In a fourth aspect, the present application provides a recommendation device, comprising:

An input module, used to obtain input data, where the input data includes data generated by at least one behavior of a user on a terminal;

An acquisition module is used to acquire a low-precision embedding representation corresponding to the input data from a low-precision embedding representation vocabulary, where the low-precision embedding representation includes multiple features;

A dequantization module, used to dequantize multiple features according to the adaptive step size corresponding to each of the multiple features to obtain a full-precision embedded representation;

The recommendation module is used to output recommendation information based on the full-precision embedding representation as the input of the neural network, and the recommendation information is used to recommend at least one behavior of the user.

In a fifth aspect, the present application provides a quantification device, which includes: a processor, a memory, an input/output device, and a bus; the memory stores computer instructions; when the processor executes the computer instructions in the memory, the memory stores computer instructions; when the processor executes the computer instructions in the memory, it is used to implement any one of the implementation methods of the first aspect.

In a sixth aspect, the present application provides a recommendation device, comprising: a processor, a memory, an input/output device, and a bus; the memory stores computer instructions; when the processor executes the computer instructions in the memory, the memory stores computer instructions; when the processor executes the computer instructions in the memory, the device is used to implement any one of the implementation methods of the second aspect.

In the seventh aspect, an embodiment of the present application provides a chip system, which includes a processor and an input/output port, wherein the processor is used to implement the processing functions involved in the method described in the first aspect or the second aspect above, and the input/output port is used to implement the transceiver functions involved in the method described in the first aspect or the second aspect above.

In one possible design, the chip system also includes a memory, which is used to store program instructions and data for implementing the functions involved in the method described in the first aspect or the second aspect above.

The chip system may be composed of chips, or may include chips and other discrete devices.

In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions; when the computer instructions are executed on a computer, the computer executes the method described in any possible implementation of the first aspect or the second aspect.

In a ninth aspect, an embodiment of the present application provides a computer program product. The computer program product includes a computer program or instructions, and when the computer program or instructions are executed on a computer, the computer executes the method described in any possible implementation of the first aspect or the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of an artificial intelligence main framework used in this application;

FIG2 is a schematic diagram of a system architecture provided by the present application;

FIG3 is a schematic diagram of another system architecture provided by the present application;

FIG4 is a schematic diagram of an application scenario provided by the present application;

FIG5A is a schematic diagram of another application scenario provided by the present application;

FIG5B is a schematic diagram of another application scenario provided by the present application;

FIG6 is a flowchart of a quantification method provided by the present application;

FIG7 is a flowchart of another quantification method provided by the present application;

FIG8 is a flowchart of another quantification method provided by the present application;

FIG9 is a schematic diagram of another application scenario provided by the present application;

FIG10 is a schematic diagram of another application scenario provided by the present application;

FIG11 is a schematic diagram of another application scenario provided by the present application;

FIG12 is a schematic diagram of a flow chart of a recommended method provided by the present application;

FIG13 is a schematic diagram of the structure of a quantization device provided by the present application;

FIG14 is a schematic diagram of the structure of a recommended device provided by the present application;

FIG15 is a schematic diagram of the structure of a quantization device provided by the present application;

FIG16 is a schematic diagram of the structure of a recommended device provided by the present application;

FIG17 is a schematic diagram of the structure of a chip provided in the present application.

Detailed ways

The following will describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present application.

The recommendation method provided in this application can be applied to artificial intelligence (AI) scenarios. AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a structural diagram of the main framework of artificial intelligence. The following is an explanation of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.

Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.

(4) General capabilities

After the data has undergone the data processing mentioned above, some general capabilities can be further formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Smart products and industry applications

Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.

The embodiments of the present application involve related applications of neural networks. In order to better understand the solutions of the embodiments of the present application, the relevant terms and concepts of the neural networks that may be involved in the embodiments of the present application are first introduced below.

(1) Convolutional Neural Network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. Convolutional neural network contains a feature extractor consisting of a convolution layer and a subsampling layer, which can be regarded as a filter. Convolutional layer refers to the neuron layer in the convolutional neural network that performs convolution processing on the input signal. In the convolutional layer of the convolutional neural network, a neuron can only be connected to some neurons in the adjacent layers. A convolutional layer usually contains several feature planes, each of which can be composed of some rectangularly arranged neural units. The neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features is independent of position. Convolution kernels can be formalized as matrices of random sizes, and convolution kernels can obtain reasonable weights through learning during the training process of convolutional neural networks. In addition, the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

(2) Graph Convolutional Network (GCN)

Graph neural network is a deep learning model that models and processes non-Euclidean spatial data (such as graph data). Its principle is to use pairwise message passing so that graph nodes iteratively update their corresponding representations by exchanging information with their neighbors.

GCN is similar to CNN, except that the input of CNN is usually two-dimensional structured data, while the input of GCN is usually graph structured data. GCN has cleverly designed a method to extract features from graph data, so that these features can be used to perform node classification, graph classification, link prediction, and graph embedding.

(3) Loss Function

In the process of training deep neural networks, because we hope that the output of the deep neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the target value we really want, and then update the weight vector of each layer of the neural network according to the difference between the two (of course, there is usually a process of optimization before the first update, that is, pre-configuring parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it predict a lower value, and continue to adjust until the deep neural network can predict the target value we really want or a value very close to the target value we really want. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes a process of minimizing this loss as much as possible. The loss function can usually include loss functions such as squared error, cross entropy, logarithm, exponential, etc. For example, the squared error can be used as a loss function, defined as The specific loss function can be selected according to the actual application scenario.

(4) Back propagation algorithm

An algorithm that calculates the gradient of model parameters based on the loss function and updates the model parameters. The neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the parameters in the initial neural network model are updated by back propagating the error loss information, so that the error loss converges. The back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.

In the implementation mode of the present application, in the training stage or the inference stage, the BP algorithm can be used to train the model to obtain the training After the model.

(5) Gradient: The derivative vector of the loss function with respect to the parameters.

(6) Stochastic gradient: The number of samples in machine learning is very large, so the loss function is calculated each time based on data obtained by random sampling, and the corresponding gradient is called stochastic gradient.

(7)Embedding: refers to the feature representation of samples or word embedding representation.

(8) Recommendation system: The recommendation system uses machine learning algorithms to analyze and learn based on the user's historical click behavior data, then predicts the user's new requests and returns a personalized item recommendation list.

(9) Model quantization: This is a model compression method that converts high bits into low bits. For example, the model compression technology that converts conventional 32-bit floating-point operations into low-bit integer operations can be called model quantization. For example, when the low bit is quantized to 8 bits, it can be called int8 quantization, that is, a weight originally needs to be represented by float32, but after quantization, it only needs to be represented by int8. In theory, it can achieve 4 times network acceleration. At the same time, 8 bits can reduce 4 times the storage space compared to 32 bits, reducing storage space and computing time, thereby achieving the purpose of compressing the model and accelerating.

(10) Automatic machine learning (AutoML): refers to the design of a series of advanced control systems to operate machine learning models so that the models can automatically learn appropriate parameters and configurations without human intervention. In learning models based on deep neural networks, automatic computational learning mainly includes network architecture search and global parameter setting. Among them, network architecture search is used to allow computers to generate the neural network architecture that best suits the problem based on data, which has the characteristics of high training complexity and great performance improvement.

(11) Corpus: Also known as free text, it can be words, phrases, sentences, fragments, articles, or any combination thereof. For example, “Today’s weather is really nice” is a piece of corpus.

(12) Neural machine translation: Neural machine translation is a typical task in natural language processing. Given a sentence in a source language, the task is to output a corresponding sentence in a target language. In the commonly used neural machine translation model, the words in the sentences of the source language and the target language are encoded into vector representations, and the associations between words and sentences are calculated in the vector space to perform the translation task.

(13) Pre-trained language model (PLM): It is a natural language sequence encoder that encodes each word in a natural language sequence into a vector representation for prediction tasks. The training of PLM consists of two stages, namely the pre-training stage and the fine-tuning stage. In the pre-training stage, the model is trained on language model tasks on large-scale unsupervised text to learn word representation. In the fine-tuning stage, the model is initialized using the parameters learned in the pre-training stage and trained on downstream tasks such as text classification or sequence labeling with fewer steps, so that the semantic information obtained from pre-training can be successfully transferred to downstream tasks.

(14) Click Through Rate (CTR): refers to the probability that a user clicks on a displayed item in a specific environment.

(15) Post-click conversion rate (CVR): refers to the probability that a user converts a clicked item in a specific environment. For example, if a user clicks on the icon of an APP, conversion refers to downloading, installing, registering, etc.

(16)Epoch

Defines the number of times the learning algorithm works on the entire training set. An epoch can be considered the number of times the neural network is trained using the entire training set.

(17) batch;

Closely related to the definition of epoch, an epoch involves training the neural network using the entire dataset, and a batch represents one of the batches of data in an epoch, specifically expressed as batch_size*batchs=epoch. It can be understood that each epoch is divided into one or more batches, and each batch can use part of the data in the training set to train the neural network.

The recommendation method provided in the embodiment of the present application can be executed on a server or on a terminal device. The terminal device can be a mobile phone with image processing function, a tablet personal computer (TPC), a media player, a smart TV, a laptop computer (LC), a personal digital assistant (PDA), a personal computer (PC), a camera, a video camera, a smart watch, a wearable device (WD) or an autonomous driving vehicle, etc., and the embodiment of the present application does not limit this.

The following introduces the system architecture provided by the embodiments of the present application.

Referring to FIG. 2 , an embodiment of the present application provides a system architecture 200 . As shown in the system architecture 200 , a data acquisition device 260 can be used to collect training data. After the data acquisition device 260 collects the training data, the training data is stored in a database 230 , and the training device 220 trains the target model/rule 201 based on the training data maintained in the database 230 .

The following describes how the training device 220 obtains the target model/rule 201 based on the training data. Exemplarily, the training device 220 processes multiple frames of sample images, outputs corresponding predicted labels, and calculates the loss between the predicted labels and the original labels of the samples, and updates the classification network based on the loss until the predicted labels are close to the original labels of the samples or the difference between the predicted labels and the original labels is less than a threshold, thereby completing the training of the target model/rule 201. For a detailed description, please refer to the training method in the following text.

The target model/rule 201 in the embodiment of the present application can specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 230 does not necessarily all come from the collection of the data acquisition device 260, and may also be received from other devices. It should also be noted that the training device 220 does not necessarily train the target model/rule 201 entirely based on the training data maintained by the database 230, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation on the embodiments of the present application.

The target model/rule 201 obtained by training the training device 220 can be applied to different systems or devices, such as the execution device 210 shown in FIG. 2 . The execution device 210 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, augmented reality (AR)/virtual reality (VR), a vehicle terminal, a television, etc., or a server or a cloud. In FIG. 2 , the execution device 210 is configured with a transceiver 212, which can include an input/output (I/O) interface or other wireless or wired communication interfaces, etc., for data interaction with external devices. Taking the I/O interface as an example, a user can input data to the I/O interface through the client device 240.

When the execution device 210 preprocesses the input data, or when the computing module 212 of the execution device 210 performs calculations and other related processing, the execution device 210 can call the data, code, etc. in the data storage system 250 for corresponding processing, and can also store the data, instructions, etc. obtained from the corresponding processing into the data storage system 250.

Finally, the transceiver 212 returns the processing result to the client device 240 so as to provide it to the user.

It is worth noting that the training device 220 can generate corresponding target models/rules 201 based on different training data for different goals or different tasks. The corresponding target models/rules 201 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.

In the case shown in FIG. 2 , the user can manually give input data, and the manual giving can be operated through the interface provided by the transceiver 212. In another case, the client device 240 can automatically send input data to the transceiver 212. If the client device 240 is required to automatically send input data, the user can set the corresponding authority in the client device 240. The user can view the results output by the execution device 210 on the client device 240, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 240 can also be used as a data acquisition terminal to collect the input data of the input transceiver 212 and the output result of the output transceiver 212 as shown in the figure as new sample data, and store it in the database 230. Of course, it is also possible not to collect through the client device 240, but the transceiver 212 directly stores the input data of the input transceiver 212 and the output result of the output transceiver 212 as new sample data in the database 230.

It is worth noting that FIG2 is only a schematic diagram of a system architecture provided in an embodiment of the present application. The positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG2, the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 can also be placed in the execution device 210.

As shown in FIG. 2 , a target model/rule 201 is obtained through training by a training device 220 . In an embodiment of the present application, the target model/rule 201 may be a recommendation model in the present application.

Exemplarily, the system architecture of the application of the neural network training method provided by the present application can be shown in Figure 3. In the system architecture 300, the server cluster 310 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as data storage, routers, load balancers, etc. The server cluster 310 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the steps of the neural network training method provided by the present application.

Users can operate their respective user devices (e.g., local device 301 and local device 302) to interact with server cluster 310. Each local device can represent any computing device, such as a personal computer, a computer workstation, a smart phone, a tablet computer, a smart camera, a smart car or other type of cellular phone, a media consumption device, a wearable device, a set-top box, a game console, etc.

The local device of each user can interact with the server cluster 310 through a communication network of any communication mechanism/communication standard, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof. Specifically, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, etc. The wireless network includes, but is not limited to: a fifth-generation mobile communication technology (5th-Generation, 5G) system, a long-term evolution (long term evolution, LTE) system, a global system for mobile communication (global system for mobile communication, GSM) or a code division multiple access (code division multiple access, CDMA) network, a wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), radio frequency identification technology (radio frequency identification, RFID), long-range (Lora) wireless communication, and near-field wireless communication (NFC) Any one or more combinations. The wired network may include an optical fiber communication network or a network composed of coaxial cables, etc.

In another implementation, one or more aspects of the execution device 210 may be implemented by each local device. For example, the local device 301 may provide local data or feedback calculation results to the execution device 210 .

It should be noted that all functions of the execution device 210 can also be implemented by the local device. For example, the local device 301 implements the functions of the execution device 210 and provides services to its own user, or provides services to the user of the local device 302.

Generally, a machine learning system can include a personalized recommendation system. Based on input data and labels, the parameters of the machine learning model can be trained through optimization methods such as gradient descent. After the model parameters converge, the model can be used to predict unknown data. Taking the click-through rate prediction in a personalized recommendation system as an example, its input data includes user features, item features, and context features. How to predict a personalized recommendation list based on user preferences has an important impact on improving the user experience of the recommendation system and the platform revenue.

For example, taking the click rate prediction model in the recommendation system as an example, as shown in FIG4 , it can generally include the Embedding and MLP layers, that is, the feature interaction layer, deep neural network layer and prediction layer shown in FIG4 . The Embedding is used to map high-dimensional sparse data to low-dimensional dense vectors, and the MLP layer is generally used to fit the combination relationship and sequence information between features to approximate the actual click rate distribution. Mainstream models are based on the representation of features based on the embedding parameters, and the explicit/implicit combination relationship of features is learned based on the representation. However, the recommendation model has many features, resulting in a large Embedding scale, such as TB level for Internet companies. The embedding representation vocabulary (Embedding table) is too large, and the video memory of a single GPU or NPU computing card is not enough to store all parameters, and multiple nodes are required for distributed storage. However, distributed storage brings new problems: more memory overhead is required; in the training/inference stage, the Embedding parameters need to be pulled through the network, which brings more communication overhead, increases the delay of model calculation, and ultimately affects the recommendation effect.

In order to reduce the memory usage of the Embedding table, the Embedding table can usually be quantized, thereby compressing the Embedding table by reducing the precision.

For example, pruning can be used for compression. Parameter thresholds can be set and parameters in the Embedding table that are below the threshold can be pruned. After pruning the Embedding parameters, retraining can be performed based on the pruned Embedding. However, only the memory in the inference phase is compressed, and the training memory is not compressed; retraining is required, which increases the training cost; and the generated Embedding table is unstructured data and requires special storage.

For another example, compression can be performed based on AutoML, such as adjusting the number of features and the size of different features in the embedding table end-to-end based on the reinforcement learning and differentiable architecture learning method (DARTS) method. After the model converges, the model is retrained. However, the search time is long and the practicality is poor.

For example, in the hash-based compression, high-frequency features are independently assigned embeddings, and low-frequency features are mapped using hash functions, thereby achieving the purpose of compressing the embedding parameters of low-frequency features. However, there may be feature conflicts, resulting in precision loss.

For example, in some low-precision training methods, all parameters in the training process are stored as low-precision parameters, and fp32 full-precision parameters are obtained through dequantization, and then forward and reverse calculations are performed to obtain full-precision gradients, and then the fp32 full-precision parameters are updated according to the learning rate step η to obtain updated parameters. However, when the weight is smaller and smaller, much smaller than the quantization step, deterministic rounding will erase the parameter update, causing the network to be unable to be trained, thereby affecting the training accuracy.

Therefore, the present application provides a quantization method for preserving more parameter information and improving quantization accuracy by setting an adaptive quantization step size.

First, to facilitate understanding, the application scenarios of the method provided in this application are introduced.

Generally, the quantification method provided in this application can be applied to a language model or a recommendation model. The language model may include a neural machine translation or PLM model. The recommendation model may include a click-through rate prediction model, a conversion rate prediction model, etc. For example, The embedding table is used to extract the representation of the input corpus, and then the semantics corresponding to the representation is obtained, followed by further translation or semantic recognition. The subsequent steps can be carried out according to the tasks that the model needs to perform.

Exemplarily, taking the recommendation scenario as an example, the application recommendation framework of the present application can be shown in FIG5A, which can be divided into a training part and an online reasoning part. Among them, in the training part, the training set includes input data and corresponding labels. For example, in the user product recommendation scenario, the training set can include products that the user clicks, collects or likes, and the products that are finally purchased. The training set is input into the initial model, and the parameters of the machine learning model are trained by optimization methods such as gradient descent to obtain a recommendation model. In the online reasoning part, the recommendation model can be deployed on the recommendation platform, such as deployed in a server or terminal. Here, taking the server as an example, the server can be used to output a recommendation list for the user. For example, in the product recommendation scenario, the information of the recommended products for the user can be displayed on the homepage of the user terminal, such as product icons or link titles, etc., or after the user clicks on a product, the icon or link title of the recommended product for the user can be displayed in the recommendation area.

In some application scenarios, the recommendation process can be shown in FIG5B, which may include display lists, logs, offline training, and online predictions. Users perform a series of actions in the front-end display list, such as browsing, clicking, commenting, downloading, etc., to generate behavioral data, which is stored in the log. The recommendation system uses data including user behavior logs to perform offline model training, generates a prediction model after the training converges, deploys the model in an online service environment, and gives recommendation results based on user request access, product features, and contextual information. Then the user generates feedback on the recommendation results to form user data.

In the offline training and online prediction parts, when the model's Embedding table becomes larger, it will lead to an increase in training memory and an increase in computing latency. In order to reduce the memory usage of the Embedding table in both the training and reasoning stages, this application proposes an end-to-end Adaptive Low-Precision Training framework, which can be used to compress the memory of the Embedding table in the recommendation model, including training memory and reasoning memory, thereby reducing the storage overhead of saving, using, and training models.

The process of the quantification method provided in this application is introduced below.

Referring to FIG6 , a flowchart of a quantification method provided by the present application is described as follows.

601. Obtain full-precision embedding representation.

The full-precision embedded representation may include multiple features, and each feature may be represented as one or more sets of feature vectors.

The full-precision embedding representation may include all or part of the features in the embedding table. If the full-precision embedding table is obtained, all or part of the data can be directly read from the full-precision embedding table to obtain the aforementioned full-precision embedding representation. If the low-precision embedding table is obtained, all or part of the features can be read from the low-precision embedding table, and the read features can be dequantized to obtain the full-precision embedding representation.

Generally, the embedding layer in a neural network can be used to map high-dimensional sparse data to low-dimensional dense vectors, specifically by querying the low-dimensional representation corresponding to the input data from the embedding table. It can be understood that the embedding table stores low-dimensional representations of various data. Usually, the input data is high-dimensional sparse data, and the high-dimensional sparse data can be mapped to low-dimensional representations through the embedding table, which is equivalent to splitting the semantics of multiple dimensions included in the input data.

Optionally, during the training process of the neural network, the representation corresponding to the input data of the current iteration can be obtained from the low-precision embedding representation vocabulary to obtain the low-precision embedding representation of the current iteration; the low-precision embedding representation of the current iteration is dequantized to obtain the full-precision embedding representation of the current iteration.

Optionally, the neural network may include a language model or a recommendation model. The language model may include models such as neural machine translation or PLM. The recommendation model may include a click-through rate prediction model, a conversion rate prediction model, etc. Therefore, the method provided in this application can be applied to language processing or recommendation scenarios.

602. Determine an adaptive step size corresponding to each of the multiple features.

Before quantizing the embedding, the adaptive step size corresponding to each feature can be determined.

Optionally, a heuristic algorithm may be used to calculate the adaptive step size corresponding to each feature, or the adaptive step size may be calculated by learning.

The heuristic algorithm may specifically include: calculating the adaptive step size corresponding to each feature according to the absolute value of the weight in each feature. For example, the adaptive quantization step size may be calculated according to the maximum absolute value of the weight in each embedding vector: Where e is the embedding parameter vector, |·| _∞ is the maximum value, the maximum value of the current vector is taken and divided into 2^(m-1) equal parts, and m is the number of bits.

The adaptive compensation is applied to the training process of the neural network by learning and calculating the quantization, such as calculating the adaptive step size in the current iteration according to the weights in the neural network after the current iteration update and the step size updated in the training process of the neural network in the previous iteration. Long, thus achieving higher training accuracy.

Generally, different methods can be used to calculate the adaptive step size in different scenarios. For example, in the scenario of training a neural network, you can choose a heuristic or learning method. For example, if the accuracy requirement is high and there are many training resources, you can choose a learning method to calculate the adaptive step size. If the computational efficiency requirement is high, you can choose a heuristic method for quantization. For example, when saving the Embedding table, you can use a heuristic algorithm to calculate the adaptive step size, so that the adaptive step size can be calculated efficiently without relying on the training-related parameters of the neural network.

In addition, after calculating the adaptive step size corresponding to each feature, the adaptive step size corresponding to each feature can be saved, so that when dequantization is performed later, the low-precision features can be losslessly dequantized based on the adaptive step size to obtain full-precision features.

Optionally, during the training of the neural network, the full-precision embedding representation of the current iteration can be used as the input of the neural network to obtain the full-precision gradient corresponding to the prediction result of the current iteration; the full-precision embedding representation is updated according to the full-precision gradient to obtain the updated full-precision embedding representation; the adaptive step size corresponding to each feature in the updated full-precision embedding representation is obtained according to the full-precision gradient. Therefore, during the training process, the adaptive step size adapted to the updated parameters can be updated in real time according to the updated parameters. Usually, if quantization is performed according to a fixed step size, for scenarios where the parameter update is less than the quantization step size, it may be directly truncated to cause data loss. In the method provided in the present application, when the parameter update is less, the calculation step size can be adaptively calculated based on the updated parameters, so that the parameters with less updates can be retained, which can reduce the loss of precision.

603. Quantize multiple features according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation.

After determining the adaptive step size corresponding to each feature in the full-precision embedded representation, each feature can be quantized based on the adaptive step size corresponding to each feature to obtain a low-precision embedded representation. Therefore, the storage resources or transmission resources of the computing device required to save or transmit the low-precision embedded representation are lower than the storage resources of the computing device required to save or transmit the full-precision embedded representation. The computing device may include a device that executes the quantization method or recommended method provided in this application.

In the implementation of the present application, for each feature in the full-precision embedding table, the corresponding adaptive step size is calculated and quantized according to the adaptive step size. Therefore, when quantizing, quantization can be performed based on the matching adaptive step size. For some features whose number does not match the quantization bit, the adaptive step size can be used for quantization. Compared with quantization using a fixed step size, the use of adaptive step size quantization can reduce precision loss and improve quantization accuracy.

In addition, if the aforementioned steps 601 to 603 are one of the iterative processes of updating the neural network, after the low-precision embedding representation is obtained by quantization, the low-precision embedding representation vocabulary is updated based on the low-precision embedding representation to obtain an updated low-precision embedding representation vocabulary, and the updated low-precision embedding representation is written back to the low-precision embedding table.

The method of the present application can be applied to various model preservation or model training processes. For example, when saving a model, a quantization method provided by the present application can be used to achieve lower precision quantization. Alternatively, in the process of training a model, the quantization method provided by the present application can be used to reduce the amount of data required to be transmitted during training and reduce the required cache space.

For the scenario of quantization before saving the model, you can refer to the steps in Figure 6 above. The following is an exemplary introduction using the process of quantization during model training as an example.

Taking the application in training scenarios as an example, during training, all or part of the features in the Embedding table can be quantized in each iterative training process. Taking one of the iterative training processes as an example, the process of the quantization method provided in this application can be shown in Figure 7.

It should be understood that, in the iterative training process, training can usually be performed in one or more epochs, and each epoch can be divided into multiple batches. In the embodiment of the present application, one of the batches is taken as an example for exemplary introduction.

701. Determine low-precision batch embedding from the low-precision embedding table.

Among them, in a batch, the input data of the current batch training neural network can be used as the input of the embedding layer, and the input data can be mapped into a low-precision, low-dimensional embedding representation through a low-precision embedding table, that is, a low-precision batch Embedding.

702. Dequantize the low-precision batch embedding to obtain the full-precision batch embedding.

After obtaining the low-precision batch embedding, the low-precision batch embedding can be dequantized, that is, the inverse operation of quantization, to obtain the full-precision batch embedding, so that the neural network can obtain the representation corresponding to the input sample based on the full-precision batch embedding.

703. Get the full-precision gradient corresponding to the current batch prediction results of the neural network through full-precision batch embedding.

After obtaining the full-precision batch embedding, during the neural network training process of the current batch, the training samples can be mapped to The full-precision batch embedding of is used as the input of the neural network and the prediction result is output. Then, based on the prediction result and the true label of the input training sample, the value of the loss function is calculated, and the full-precision gradient of the parameters of the neural network in the current batch is calculated based on the value of the loss function.

704. Update the weights of the neural network according to the full-precision gradient to obtain an updated neural network.

After obtaining the full-precision gradient, the weights of the neural network can be updated based on the full-precision gradient to obtain the updated neural network for the current batch.

For example, the parameters of the neural network can be updated through the back propagation algorithm. Usually, the forward transmission of the input signal to the output will generate error loss, and the parameters in the initial neural network model are updated by back propagating the error loss information, so that the error loss converges.

705. Update the full-precision batch Embedding according to the full-precision gradient to obtain a new full-precision batch Embedding and quantization step size.

After obtaining the full-precision gradient, the adaptive step size can be updated based on the full-precision gradient, and the full-precision batch Embedding can be quantized based on the adaptive step size to obtain a new low-precision batch Embedding, and the updated low-precision batch Embedding can be saved in the low-precision Embedding table to achieve low-precision storage and transmission of the Embedding table, thereby reducing the storage space required for saving and transmitting the Embedding table.

Specifically, the adaptive step size can be calculated through learning and combined with the post-weight updated in each iteration, so that the Embedding table can be quantized in real time based on the update process of the neural network, thereby reducing the storage space occupied during training and saving.

Of course, the adaptive step size can also be calculated through heuristic algorithms, such as calculating the adaptive step size corresponding to each feature in the full-precision batch embedding according to the absolute value of the updated full-precision batch embedding weight, so that the adaptive step size can be calculated efficiently and accurately.

706. Quantize the new batch embedding according to the adaptive quantization step size to obtain a new low-precision batch embedding.

After obtaining the adaptive quantization step size, the updated full-precision batch embedding can be quantized based on the adaptive quantization step size to obtain a new low-precision batch embedding.

Optionally, in a specific quantization process, the discrete value of each feature, or discrete feature, can be obtained according to the adaptive step size corresponding to each feature. Subsequently, the discrete feature of each feature can be truncated by a random truncation algorithm to obtain a low-precision Embedding table. When truncating by the random truncation algorithm, the value of each discrete feature determines the truncation value, so that the truncation length matches the update of the feature value. Even if the parameter update amplitude is small, the updated part can be quantized to achieve quantization accuracy.

707. Determine whether convergence has occurred. If so, terminate the iteration. If not, execute step 701.

After each batch of training, it can be determined whether the neural network has converged. If so, the iteration can be terminated, that is, the neural network after the current batch training is output. If not, the neural network has not converged, and the iterative training can continue.

Among them, determining whether the neural network converges can be to judge whether the number of iterations reaches a preset number, the change in the loss value is less than a preset value, or whether the iteration duration reaches a preset duration, etc., which can be determined according to your own application scenario, and this application does not limit this.

Therefore, in the implementation mode of the present application, in the process of training the neural network, the adaptive step size can be updated based on the calculated gradient, and quantization can be performed according to the adaptive step size adapted to each feature, so that the quantization accuracy of each feature can be guaranteed as much as possible, lower precision quantization can be achieved, and the loss of information during quantization can be reduced.

The foregoing describes the process of applying the quantization method provided in the present application to the neural network training process. For ease of understanding, the quantization method provided in the present application is described below in conjunction with more specific recommended scenarios.

Referring to FIG8 , a flow chart of another quantization method provided by the present application is described as follows.

In the forward stage, the recommendation model inputs a batch of high-dimensional sparse data, reads the feature id in the batch data, reads the corresponding batch embedding from the low-precision embedding table, and then obtains the low-precision batch embedding with full-precision representation that can be used for subsequent neural network calculations through inverse quantization; in the reverse stage, the gradient of the current batch embedding is obtained from the upper network, and the batch embedding is updated based on the gradient. Since the embedding table stores low-precision parameters, it is necessary to obtain the low-precision batch embedding through quantization, and then finally write it into the low-precision Embedding Table. The specific steps may include the following.

First, the user's log data 801 is read, and the log data can be used as a training set for the recommendation model.

The user's log data may include information generated when the user uses the client. Usually, when the user uses different clients, Different information can be generated. For example, when a user uses a music app, the information of the music played, clicked, collected or searched by the user can be saved in the user's log; when a user uses a shopping app, the information of the items browsed, collected or purchased by the user can be saved in the user's log; when a user uses an application market, the information of the apps clicked, downloaded, installed or collected by the user can be saved in the user's log, etc.

Subsequently, high-dimensional sparse batch data 802 of the current batch is read from the user log data.

In each batch, part of the user's log data can be extracted as the high-dimensional sparse data of the current batch and used as the training data for the current iteration.

Then, the corresponding low-precision batch embedding is read from the low-precision embedding table803.

Usually, the user's log data is high-dimensional sparse data, so the high-dimensional sparse data can be mapped to low-dimensional features through the embedding table so that the model can recognize each feature and process it. That is, after reading the high-dimensional sparse batch data of the current batch from the log data, the high-dimensional sparse batch data can be mapped to low-dimensional representations through the low-precision embedding table, such as expressed as low-precision batch embedding.

Subsequently, dequantization is performed to obtain the full-precision batch embedding804.

After obtaining the low-precision batch embedding, the low-precision batch embedding is dequantized through the dequantization algorithm to obtain the full-precision batch embedding.

For example, you can use the de-quantization function to get the fp32 full-precision parameters Where Δ is the adaptive step size corresponding to the batch embedding.

Subsequently, the full-precision batch embedding can be used as the input of the recommendation model 805 and the prediction result 806 can be output.

Then, the full-precision gradient of the current batch is calculated based on the prediction result 806, and the batch embedding and quantization step size 807 are updated based on the full-precision gradient of the current batch.

After obtaining the prediction result, the loss value between the prediction result and the true label of the input sample can be calculated, and back propagation is performed based on the loss value to calculate the full-precision gradient of each parameter in the current batch recommendation model. Update the fp32 full-precision parameters in batch embedding according to the learning rate step η Get ω and quantization step size Δ.

Specifically, the adaptive quantization step size may be calculated by a heuristic method or by a learning method.

For example, the steps for heuristically calculating the adaptive step size can be expressed as: Calculate the adaptive quantization step size based on the maximum absolute value of the weight in each embedding vector: Where e is the embedding parameter vector, |·| _∞ is the maximum value, and the physical meaning of this method is to take the maximum value of the current vector and divide it into 2^(m-1) equal parts, and m is the number of bits.

For example, the step of learning to calculate the adaptive quantization step size may include: after the weight is updated, the updated weight and the unupdated quantization step size are trained in a quantization-aware manner to update the quantization step size end-to-end. For example, it can be expressed as:

First update the weight parameters:

The adaptive step size is then updated as expressed as:

Then output the updated embedding parameters Updated adaptive step size And the parameters of the updated recommendation model

It is then quantized to obtain the low-precision batch embedding808 and written back to the Embedding table.

After obtaining the adaptive step size corresponding to each feature, the updated parameter ω can be quantized. The quantization process can be expressed as:

Where m is the number of bits, R() is the stage rounding function, which can usually include multiple types, such as deterministic truncation rounding or random truncation rounding. When the weight is smaller and much smaller than the quantization step, if deterministic truncation rounding is used, the parameter update will be erased, which may cause the network to be unable to be trained. Therefore, this application uses random truncation to perform truncation, which can be expressed as:

The clip function is used to return -2 m ^{-1 when ω/Δ is less than -2 m} ^-1 , and 2 m ^- ^{1 if ω/Δ is greater than 2 m-1} .

In the implementation manner of the present application, a quantization step size is better selected for the Embedding parameters of each feature to retain as much parameter information as possible, helping the model to still ensure convergence during low-precision training. Through lower-precision training, the memory usage and communication overhead of embedding during training and reasoning can be reduced, so that the same memory can accommodate more parameters. In addition, a randomly truncated Rounding function can be used to ensure that the gradient information in the low-precision training process will not be lost due to deterministic truncation. In addition, when updating the adaptive step size, heuristic adaptive quantization step sizes and learning-based adaptive quantization step sizes are provided to adapt to different application scenarios, so as to avoid the need for manual selection of quantization step sizes for different features, thereby improving model training and quantization efficiency.

For ease of understanding, the following exemplarily introduces the effects of the quantification method provided in this application by taking some specific application scenarios as examples.

In a large number of personalized service scenarios, there will be interaction records between users and products based on different types of behaviors. The recommendation model will model the user's multi-behavior interaction history, predict the products that the user may interact with based on the target behavior, and sort the products and display them to the user. Click-through rate prediction can be performed in the manner provided by this application, and the products can be sorted according to the predicted click-through rate and displayed in the recommended page in the sorted order; or the predicted click-through rate value can be sorted and displayed; or the top few click-through rates can be sorted; or each object to be recommended can be scored, and the items can be sorted and displayed according to the score value, etc.

For example, the method provided in the present application can be applied to an APP recommendation scenario. As shown in FIG. 9 , the icon of the recommended app can be displayed in the display interface of the user's terminal to facilitate the user to perform further operations such as clicking or downloading the recommended app, thereby allowing the user to quickly find the required app and improve the user experience.

For another example, the method provided in the present application can be applied to a product recommendation scenario. As shown in FIG10 , the icon of the recommended product can be displayed in the display interface of the user's terminal to facilitate the user to perform further operations such as clicking, adding to cart, or purchasing the recommended product, thereby allowing the user to view the required products and improving the user experience.

For example, the method provided in the present application can be applied to a music recommendation scenario. As shown in FIG11 , an icon of recommended music can be displayed in the display interface of the user's terminal to facilitate the user to perform further operations such as clicking, collecting or playing the recommended music, thereby allowing the user to view more preferred music and improving the user experience.

Taking the click-through rate prediction scenario in the app recommendation scenario as an example, the click-through rate prediction model can usually include two parts: embedding and MLP. The recommended data is high-dimensional and sparse, and the embedding table is large, which will cause problems such as increased memory usage and increased training latency. The commonly used pruning and AutoML methods cannot compress the training memory, and the accuracy of the hash-based method will be lost. The traditional low-precision training method can only use INT16, and does not consider how to use adaptive quantization step size. In the quantization method based on adaptive quantization step size provided by this application, when training the click-through rate prediction model offline, the continuous features are first normalized and then automatically discretized.

During offline training, in each batch, Batch Embedding is taken from the low-precision Embedding Table; low-precision parameters represented by full precision are obtained through inverse quantization calculation, which are used for MLP layer calculation and finally output predicted values; in the training phase, the predicted values are output and the loss function is calculated with the predicted values, and the full-precision gradient of Batch Embedding is obtained by reverse gradient calculation; the Batch Embedding module is updated based on the Batch full-precision gradient, and the quantization step size is adaptively updated; the Batch Embedding is quantized into low-precision parameters based on the adaptive quantization step size; and the low-precision Batch Embedding is then written back to the embedding table.

In the online inference stage, the embedding corresponding to the input data can be read from the low-precision embedding table, and dequantized to obtain the full-precision embedding. The full-precision embedding is used as the input of the click-through rate prediction model to output the prediction result.

For example, some public data sets are used as examples to compare some existing quantization methods with the quantization method provided by the present application, such as using the Avazu data set and the Criteo data set. The statistical information of the data sets can be shown in Table 1.

Table 1

The training set and test set in the dataset are divided according to users, with 90% of users as training sets and 10% of users as test sets. Discrete features are one-hot encoded and continuous features are discretized. Evaluation indicators include AUC (Area Under Curve).

Some existing quantization methods include: full precision method (Full Precision, FP), quantization-aware method (LSQ), quantization-aware method based on dynamic step size (PACT), INT8 low precision training method (LPT) and INT16 low precision training method (LPT-16), etc. The quantization method provided in this application can be based on different adaptive step size calculation methods, such as: heuristic adaptive step size INT8 low precision training method (ALPT_H) and learnable adaptive step size INT8 low precision training method (ALPT_L).

The comparison results can be shown in Table 2:

Table 2

Among them, the deterministic Rounding function is used in the above Table 2, and the random stage Rounding function achieves better results in low-precision training, as shown in Table 3.

table 3

By comparing Table 2 and Table 3 above, the existing low-precision training method adopts deterministic truncation and does not consider adaptive quantization step size. It can only perform low-precision parameter training based on INT16, which makes it difficult for the model to converge during lower precision training. For example, the Embedding parameters in the compression inference stage need to be retrained, which has low practicality. Although some quantization methods can compress parameters through the hash method, the accuracy is low due to the inevitable collision of the hash function. Although some quantization methods can train the model through INT16, lower precision training is often difficult to converge. In order to perform lower precision low-precision training end-to-end, this application proposes to use a randomly truncated Rounding function to ensure the parameter update of gradient information during training; and proposes to assign an adaptive quantization step size to each feature to better select the quantization step size to retain as much parameter information as possible.

In addition, based on the aforementioned quantification method, the present application also provides a recommendation method, as shown in FIG12 , which may specifically include:

1201. Get input data.

The input data may include data generated by at least one behavior of the user on the terminal.

For example, when a user clicks on or plays a piece of music, information about the user clicking on the music can be collected; or when a user downloads or installs an app, information about the user downloading or installing the app can be collected.

1202. Get the low-precision embedding corresponding to the input data from the low-precision embedding table.

After obtaining the input data, the input data can be converted into features that can be recognized by the neural network through the embedding table. The low-precision embedding table usually stores the mapping relationship between the original data and the representation. After obtaining the input data, the embedding table can be used to convert the input data into features that can be recognized by the neural network. relation, mapping the input data into low-precision embeddings.

1203. Dequantize multiple features according to the adaptive step size corresponding to each feature to obtain full-precision embedding.

After obtaining the low-precision embedding, each feature can be dequantized according to the adaptive step size corresponding to each feature to obtain the full-precision embedding.

The inverse quantization step may refer to step 702 in FIG. 7 or step 804 in FIG. 8 , which will not be described in detail here.

1204. Output recommendation information based on the full-precision embedding as the input of the neural network.

After obtaining the full-precision embedding, the obtained full-precision embedding can be used as the input of the recommendation network to output the corresponding recommendation information.

The above is an introduction to the method flow provided by the present application. Based on the above method flow, the following is an introduction to the device provided by the present application.

Referring to FIG. 13 , a schematic diagram of a quantization device provided by the present application includes:

An acquisition module 1301 is used to acquire a full-precision embedded representation, where the embedded representation includes multiple features;

A determination module 1302 is used to determine an adaptive step size corresponding to each of the multiple features;

The quantization module 1303 is used to quantize multiple features according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation, where the accuracy of the features in the low-precision embedded representation is lower than the accuracy of the features in the full-precision embedded representation.

The acquisition module 1301 is specifically used to obtain the representation corresponding to the input data of the current iteration from the low-precision embedding representation vocabulary to obtain the low-precision embedding representation of the current iteration; dequantize the low-precision embedding representation of the current iteration to obtain the full-precision embedding representation of the current iteration.

In one possible implementation, the determination module 1302 is specifically used to: use the full-precision embedding representation of the current iteration as the input of the neural network to obtain the full-precision gradient corresponding to the prediction result of the current iteration; obtain the updated full-precision embedding representation according to the full-precision gradient to obtain the updated full-precision embedding representation; obtain the adaptive step size corresponding to each feature in the updated full-precision embedding representation according to the full-precision gradient.

In a possible implementation, the quantization module 1303 is specifically configured to quantize multiple features in the full-precision low-dimensional representation of the current iteration according to the adaptive step size corresponding to each feature, so as to obtain a low-precision embedded representation.

In a possible implementation, the determination module 1302 is specifically configured to calculate the adaptive step size corresponding to each feature by using a heuristic algorithm.

In a possible implementation, the determination module 1302 is specifically configured to calculate the adaptive step size corresponding to each feature according to the absolute value of the weight in each feature.

In a possible implementation, the quantization module 1303 is specifically used to: obtain a discrete feature of each feature according to the adaptive step size corresponding to each feature; and truncate the discrete feature of each feature by a random truncation algorithm to obtain a low-precision embedded representation.

Referring to FIG. 14 , a schematic diagram of a recommended device provided by the present application includes:

An input module 1401 is used to obtain input data, where the input data includes data generated by at least one behavior of a user on a terminal;

An acquisition module 1402 is used to acquire a low-precision embedding representation corresponding to the input data from a low-precision embedding representation vocabulary, where the low-precision embedding representation includes multiple features;

A dequantization module 1403, configured to dequantize the multiple features according to an adaptive step size corresponding to each of the multiple features to obtain a full-precision embedded representation;

The recommendation module 1404 is used to output recommendation information based on the full-precision embedding representation as the input of the neural network, and the recommendation information is used to At least one behavior of the user is recommended.

Please refer to FIG. 15 , which is a schematic diagram of the structure of another quantization device provided in the present application, as described below.

The recommendation device may include a processor 1501 and a memory 1502. The processor 1501 and the memory 1502 are interconnected via a line. The memory 1502 stores program instructions and data.

The memory 1502 stores program instructions and data corresponding to the steps in the aforementioned FIGS. 6 to 8 .

The processor 1501 is used to execute the method steps performed by the quantization device shown in any of the embodiments in FIG. 6 to FIG. 8 .

Optionally, the recommendation device may further include a transceiver 1503 for receiving or sending data.

A computer-readable storage medium is also provided in an embodiment of the present application. The computer-readable storage medium stores a program, which, when executed on a computer, enables the computer to execute the steps of the method described in the embodiments shown in the aforementioned Figures 6 to 8.

Optionally, the recommended device shown in the aforementioned FIG. 15 is a chip.

Please refer to FIG. 16 , which is a schematic diagram of the structure of another recommended device provided by the present application, as described below.

The recommendation device may include a processor 1601 and a memory 1602. The processor 1601 and the memory 1602 are interconnected via a line. The memory 1602 stores program instructions and data.

The memory 1602 stores program instructions and data corresponding to the steps in FIG. 12 .

The processor 1601 is used to execute the method steps performed by the recommendation device shown in FIG. 12 .

Optionally, the recommendation device may further include a transceiver 1603 for receiving or sending data.

A computer-readable storage medium is also provided in an embodiment of the present application. The computer-readable storage medium stores a program, which, when executed on a computer, enables the computer to execute the steps of the method described in the embodiment shown in FIG. 12 above.

Optionally, the recommended device shown in the aforementioned FIG. 16 is a chip.

An embodiment of the present application also provides a recommendation device, which can also be called a digital processing chip or chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit. The processing unit is used to execute the method steps of the aforementioned Figure 11.

An embodiment of the present application also provides a recommendation device, which can also be called a digital processing chip or chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit. The processing unit is used to execute the method steps of the aforementioned Figure 12.

The embodiment of the present application also provides a digital processing chip. The digital processing chip integrates a circuit and one or more interfaces for implementing the functions of the above-mentioned processor 1501, processor 1601, or processor 1501, processor 1601. When the digital processing chip integrates a memory, the digital processing chip can complete the method steps of any one or more embodiments in the above-mentioned embodiments. When the digital processing chip does not integrate a memory, it can be connected to an external memory through a communication interface. The digital processing chip implements the recommendation device or the action performed by the recommendation device in the above-mentioned embodiment according to the program code stored in the external memory.

An embodiment of the present application also provides a computer program product, which, when executed on a computer, enables the computer to execute the steps of the method described in the embodiments shown in the aforementioned Figures 6 to 12.

The recommendation device or recommendation device provided in the embodiment of the present application may be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the server executes the method steps described in the embodiments shown in the above-mentioned Figures 6 to 12. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.

Specifically, the aforementioned processing unit or processor may be a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any Any conventional processor, etc.

For example, please refer to FIG. 17, which is a schematic diagram of a structure of a chip provided in an embodiment of the present application. The chip can be a neural network processor NPU 170, which is mounted on the host CPU (Host CPU) as a coprocessor and assigned tasks by the Host CPU. The core part of the NPU is the operation circuit 1703, which is controlled by the controller 1704 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the operation circuit 1703 includes multiple processing units (process engines, PEs) inside. In some implementations, the operation circuit 1703 is a two-dimensional systolic array. The operation circuit 1703 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1703 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit takes the corresponding data of matrix B from the weight memory 1702 and caches it on each PE in the operation circuit. The operation circuit takes the matrix A data from the input memory 1701 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 1708.

The unified memory 1706 is used to store input data and output data. The weight data is directly transferred to the weight memory 1702 through the direct memory access controller (DMAC) 1705. The input data is also transferred to the unified memory 1706 through the DMAC.

The bus interface unit (BIU) 1710 is used for the interaction between the AXI bus and the DMAC and instruction fetch buffer (IFB) 1709.

The bus interface unit 1710 (BIU) is used for the instruction fetch memory 1709 to obtain instructions from the external memory, and is also used for the storage unit access controller 1705 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1706 or to transfer weight data to the weight memory 1702 or to transfer input data to the input memory 1701.

The vector calculation unit 1707 includes multiple operation processing units, which further process the output of the operation circuit when necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, upsampling of feature planes, etc.

In some implementations, the vector calculation unit 1707 can store the processed output vector to the unified memory 1706. For example, the vector calculation unit 1707 can apply a linear function and/or a nonlinear function to the output of the operation circuit 1703, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value. In some implementations, the vector calculation unit 1707 generates a normalized value, a pixel-level summed value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 1703, for example, for use in a subsequent layer in a neural network.

An instruction fetch buffer 1709 connected to the controller 1704, for storing instructions used by the controller 1704;

Unified memory 1706, input memory 1701, weight memory 1702 and instruction fetch memory 1709 are all on-chip memories. External memories are private to the NPU hardware architecture.

Among them, the operations of each layer in the recurrent neural network can be performed by the operation circuit 1703 or the vector calculation unit 1707.

The processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the methods of Figures 6 to 12 above.

It should also be noted that the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. In addition, in the drawings of the device embodiments provided by the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

Through the description of the above implementation methods, technicians in the relevant field can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Of course, it can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components, etc. In general, all functions performed by computer programs can be easily implemented with corresponding hardware, and the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for the present application, software program implementation is a better implementation method in most cases. Based on this understanding, the technical solution of the present application is essentially In other words, the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., and includes a number of instructions for enabling a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present application.

In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that includes one or more available media integrated. The available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive (SSD)), etc.

The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments described herein can be implemented in an order other than that illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

Claims

A quantification method, characterized by comprising:

Obtaining a full-precision embedding representation, wherein the embedding representation includes multiple features;

Determine an adaptive step size corresponding to each of the multiple features;

The multiple features are quantized respectively according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation, and the accuracy of the features in the low-precision embedded representation is lower than the accuracy of the features in the full-precision embedded representation.
The method according to claim 1, characterized in that the low-precision embedding representation vocabulary is applied to a neural network,

The step of obtaining a full-precision embedding representation vocabulary includes:

Obtaining a representation corresponding to the input data of the current iteration from the low-precision embedding representation vocabulary to obtain a low-precision embedding representation of the current iteration;

The low-precision embedded representation of the current iteration is dequantized to obtain the full-precision embedded representation of the current iteration.
The method according to claim 2, characterized in that the step of determining the adaptive step size corresponding to each of the multiple features comprises:

Using the full-precision embedding representation of the current iteration as the input of the neural network, and obtaining the full-precision gradient corresponding to the prediction result of the current iteration;

Acquire and update the full-precision embedding representation according to the full-precision gradient to obtain an updated full-precision embedding representation;

The adaptive step size corresponding to each feature in the updated full-precision embedding representation is obtained according to the full-precision gradient.
The method according to claim 3, characterized in that the quantizing of the multiple features respectively according to the adaptive step size corresponding to each feature comprises:

According to the adaptive step size corresponding to each feature, multiple features in the full-precision low-dimensional representation of the current iteration are quantized to obtain the low-precision embedded representation.
The method according to any one of claims 2 to 4, characterized in that the method further comprises:

The low-precision embedding representation vocabulary is updated according to the low-precision embedding representation to obtain an updated low-precision embedding representation vocabulary.
The method according to claim 1, characterized in that the step of determining the adaptive step size corresponding to each of the multiple features comprises:

The adaptive step size corresponding to each feature is calculated by a heuristic algorithm.
The method according to claim 6, characterized in that the step of calculating the adaptive step size corresponding to each feature by a heuristic algorithm further comprises:

The adaptive step size corresponding to each feature is calculated according to the absolute value of the weight in each feature.
The method according to claims 1-7 is characterized in that the multiple features are quantized respectively according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation vocabulary, and further includes:

Obtaining discrete features of each feature according to the adaptive step size corresponding to each feature;

The discrete features of each feature are truncated by a random truncation algorithm to obtain the low-precision embedding representation.
The method according to claims 1-8 is characterized in that the low-precision embedded representation vocabulary is applied to a language model or a recommendation model, the language model is used to obtain semantic information of the corpus, and the recommendation model is used to generate recommendation information based on user information.
A recommendation method, characterized by comprising:

Acquiring input data, where the input data includes data generated by at least one behavior of a user on a terminal;

Acquire a low-precision embedding representation corresponding to the input data from a low-precision embedding representation vocabulary, wherein the low-precision embedding representation includes multiple features;

Dequantizing the multiple features respectively according to the adaptive step size corresponding to each of the multiple features to obtain a full-precision embedded representation;

Output recommendation information based on the full-precision embedded representation as an input of a neural network, where the recommendation information is used to make a recommendation for the at least one behavior of the user.
The method according to claim 10 is characterized in that the neural network includes a language model or a recommendation model, the language model is used to obtain semantic information of the corpus, and the recommendation model is used to generate recommendation information based on user information.
A quantization device, characterized by comprising:

An acquisition module, configured to acquire a full-precision embedded representation, wherein the embedded representation includes multiple features;

A determination module, used to determine the adaptive step size corresponding to each of the multiple features;

A quantization module is used to quantize the multiple features respectively according to the adaptive step size corresponding to each feature to obtain a low-precision embedded representation, wherein the accuracy of the features in the low-precision embedded representation is lower than the accuracy of the features in the full-precision embedded representation.
The device according to claim 12, characterized in that the low-precision embedding representation vocabulary is applied to a neural network,

The acquisition module is specifically used for:

Obtaining a representation corresponding to the input data of the current iteration from the low-precision embedding representation vocabulary to obtain a low-precision embedding representation of the current iteration;

The low-precision embedded representation of the current iteration is dequantized to obtain the full-precision embedded representation of the current iteration.
The device according to claim 13, characterized in that the determining module is specifically used to:

Using the full-precision embedding representation of the current iteration as the input of the neural network, and obtaining the full-precision gradient corresponding to the prediction result of the current iteration;

Acquire and update the full-precision embedding representation according to the full-precision gradient to obtain an updated full-precision embedding representation;

The adaptive step size corresponding to each feature in the updated full-precision embedding representation is obtained according to the full-precision gradient.
The device according to claim 14, characterized in that

The quantization module is specifically used to quantize multiple features in the full-precision low-dimensional representation of the current iteration according to the adaptive step size corresponding to each feature, so as to obtain the low-precision embedded representation.
The device according to any one of claims 13-15 is characterized in that the acquisition module is also used to update the low-precision embedding representation vocabulary according to the low-precision embedding representation to obtain an updated low-precision embedding representation vocabulary.
The device according to claim 12, characterized in that

The determination module is specifically used to calculate the adaptive step size corresponding to each feature through a heuristic algorithm.
The device according to claim 17, characterized in that

The determination module is specifically used to calculate the adaptive step size corresponding to each feature according to the absolute value of the weight in each feature.
The device according to claims 12-18, characterized in that the quantization module is specifically used to:

Obtaining discrete features of each feature according to the adaptive step size corresponding to each feature;

The discrete features of each feature are truncated by a random truncation algorithm to obtain the low-precision embedding representation.
The device according to claims 12-19, characterized in that the low-precision embedding representation vocabulary is applied to a language model or The recommendation model is used to obtain the semantic information of the corpus, and the recommendation model is used to generate recommendation information based on the user's information.
A recommendation device, characterized by comprising:

An input module, used to obtain input data, wherein the input data includes data generated by at least one behavior of a user on a terminal;

An acquisition module, configured to acquire a low-precision embedding representation corresponding to the input data from a low-precision embedding representation vocabulary, wherein the low-precision embedding representation includes multiple features;

A dequantization module, configured to dequantize the multiple features according to an adaptive step size corresponding to each of the multiple features to obtain a full-precision embedded representation;

A recommendation module is used to output recommendation information based on the full-precision embedding representation as an input of a neural network, wherein the recommendation information is used to recommend the at least one behavior of the user.
The device according to claim 21 is characterized in that the neural network includes a language model or a recommendation model, the language model is used to obtain semantic information of the corpus, and the recommendation model is used to generate recommendation information based on user information.
A quantization device, characterized in that the communication processing device comprises: a processor, the processor is coupled to a memory;

The memory is used to store computer programs;

The processor is used to execute the computer program stored in the memory so that the scheduling device performs the recommendation method according to any one of claims 1 to 9.
A recommendation device, characterized in that the communication processing device comprises: a processor, the processor is coupled to a memory;

The memory is used to store computer programs;

The processor is used to execute the computer program stored in the memory so that the scheduling device performs the recommendation method according to any one of claims 10 to 11.
A computer program product comprising instructions, characterized in that when the computer program product is run on a computer, the computer is caused to execute the method according to any one of claims 1 to 11.
A computer-readable storage medium, characterized in that it comprises instructions, and when the instructions are executed on a computer, the computer is caused to execute the method according to any one of claims 1 to 11.