CN114064840A - Data processing method, medium, device and computing equipment - Google Patents

Data processing method, medium, device and computing equipment Download PDF

Info

Publication number
CN114064840A
CN114064840A CN202111372458.6A CN202111372458A CN114064840A CN 114064840 A CN114064840 A CN 114064840A CN 202111372458 A CN202111372458 A CN 202111372458A CN 114064840 A CN114064840 A CN 114064840A
Authority
CN
China
Prior art keywords
modulus
value
target
results
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111372458.6A
Other languages
Chinese (zh)
Inventor
张娟
许盛辉
潘照明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Media Technology Beijing Co Ltd
Original Assignee
Netease Media Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Media Technology Beijing Co Ltd filed Critical Netease Media Technology Beijing Co Ltd
Priority to CN202111372458.6A priority Critical patent/CN114064840A/en
Publication of CN114064840A publication Critical patent/CN114064840A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiment of the disclosure provides a data processing method. The data processing method comprises the following steps: determining a hash value corresponding to the target characteristic value by a hash mapping method; performing modulus processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus results; and determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus results. By combining the Hash mapping method and multiple modulus taking, the method disclosed by the invention reserves the advantages of controllable resource overhead and strong expansibility of the Hash mapping method in value conversion on one hand, and reduces the probability of collision of numerical values obtained by converting different characteristic values on the other hand. Furthermore, embodiments of the present disclosure provide a medium, an apparatus, and a computing device.

Description

Data processing method, medium, device and computing equipment
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a data processing method, a medium, an apparatus, and a computing device.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The category-type feature mainly refers to a feature expressed by a character string, such as a feature of gender, region, platform, user ID, article ID, and the like. The specific value of the category-type feature is called a feature value, for example, the feature value of the gender feature is male or female. It can be seen that the feature values of the class-type features are basically represented by text-type data. However, the computer world does not handle textual data well and requires conversion of textual data to numerical data. How to convert textual data of a categorical feature into numerical data is one of the major problems facing the handling of categorical features.
In the feature engineering of deep learning and machine learning, firstly, feature values of class-type features are converted from text-type data into numerical-type data, then feature embedding is carried out on the numerical-type data based on an embedding (embedding) table to obtain low-dimensional dense embedding vectors, and finally the feature values of the class-type features are represented through the embedding vectors. Currently, in the process of converting the feature value of the class-type feature from text-type data to numerical data, in order to reduce resource overhead and improve extensibility, a hash (hash) algorithm is used to convert n feature values of the class-type feature into numerical values of a fixed length.
However, the hash algorithm is a compression mapping, the output space is smaller than the input space, different feature values may be converted into the same value, and there is a problem of collision of value collision to some extent.
Disclosure of Invention
The present disclosure provides a data processing method, medium, device, and computing apparatus, to solve/implement a problem that there is a collision of numerical value collisions to a certain extent when a hash algorithm is used to convert a feature value.
In a first aspect of embodiments of the present disclosure, there is provided a data processing method, including: determining a hash value corresponding to the target characteristic value by a hash mapping method; performing modulus processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus results; and determining a target numerical value corresponding to the target characteristic value based on the plurality of modulus results.
In an embodiment of the present disclosure, determining a target numerical value corresponding to a target feature value according to a plurality of modulo results includes: performing characteristic embedding processing on the plurality of modulus taking results to obtain a plurality of embedded vectors; and combining the plurality of embedded vectors to obtain a target numerical value.
In another embodiment of the present disclosure, a plurality of embedding tables are pre-constructed, different models correspond to different embedding tables, a mapping relationship between a plurality of values and embedding vectors is stored in the embedding tables, and feature embedding processing is performed on a plurality of modulus extraction results to obtain a plurality of embedding vectors, including: and searching an embedding vector corresponding to the nth modulus in a plurality of modulus results in an embedding table corresponding to the nth modulus in a plurality of different sizes to obtain a plurality of embedding vectors, wherein n is changed from 1 to K, and K is the total number of the plurality of different sizes of moduli.
In yet another embodiment of the present disclosure, the embedding vectors in the embedding tables corresponding to different modes have the same dimension and different numbers.
In yet another embodiment of the present disclosure, combining the plurality of embedded vectors to obtain the target value includes: and carrying out transverse splicing processing on the plurality of embedded vectors to obtain a target numerical value.
In another embodiment of the present disclosure, a number statistical table is pre-constructed, and after performing modulo processing on the hash value based on a plurality of different modulo values to obtain a plurality of modulo results, the method further includes: in model training, the occurrence times of a plurality of modulus results are recorded in a time statistical table.
In another embodiment of the present disclosure, the number of times statistical table is multiple, different models correspond to different number of times statistical tables, and in the model training, the number of times of occurrence of multiple modulus-taking results is recorded in the number of times statistical table, which includes: in the model training, in the number statistical table corresponding to the nth module, adding one to the occurrence number of the nth module result in the plurality of module results, wherein the value of n is changed from 1 to K, and K is the total number of the plurality of modules with different sizes.
In another embodiment of the present disclosure, a number of times statistical table is constructed in advance, where the number of times statistical table is used to record the occurrence number of modulus extraction results obtained by performing modulus extraction processing based on multiple different moduli in model training, and after determining a target value corresponding to a target feature value according to multiple modulus extraction results, the method further includes: in the model application, the target characteristic value is filtered based on the times statistical table.
In another embodiment of the present disclosure, the number of times statistics table is multiple, different models correspond to different number of times statistics table, the number of times statistics table is used to record the occurrence times of a modulus-taking result obtained by performing a modulus-taking process based on the model corresponding to the number of times statistics table in model training, and in model application, the filtering of the target feature value based on the number of times statistics table includes: in the application of the model, the occurrence times of a plurality of modulus taking results are obtained in a frequency statistical table respectively corresponding to a plurality of different-size moduli; determining the occurrence frequency of the target characteristic value in the model training as the minimum occurrence frequency of the median of the occurrence frequencies of the multiple modulus results; and if the occurrence frequency of the target characteristic value in the model training is smaller than the frequency threshold value, determining that the target value corresponding to the target characteristic value is zero.
In yet another embodiment of the present disclosure, the feature to which the target feature value belongs is a class-type feature, and the target feature value is text-type data.
In a second aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the data processing method according to any one of the first aspect or the first aspect is implemented.
In a third aspect of the disclosed embodiments, there is provided a data processing apparatus comprising: the hash unit is used for determining a hash value corresponding to the target characteristic value through a hash mapping method; the modulus taking unit is used for carrying out modulus taking processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus taking results; and the determining unit is used for determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus taking results.
In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform a data processing method as described above in the first aspect or any one of the embodiments of the first aspect.
The method utilizes the idea of multipath hashing, namely, in the process of converting the characteristic value into the numerical value, the hash value corresponding to the characteristic value is determined by utilizing a Hash mapping method, and a plurality of modules are adopted to carry out modular processing on the hash value corresponding to the characteristic value to obtain a plurality of modular results, so that the probability of collision of the modular results of different characteristic values is reduced. And then, determining the numerical values corresponding to the characteristic values based on a plurality of modulus results, thereby effectively reducing the probability of collision of the numerical values corresponding to different characteristic values. Meanwhile, the method also reserves the advantages of controllable resource overhead and strong expansibility of the Hash mapping method.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
fig. 1 schematically illustrates an application scenario provided according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a data processing method provided according to an embodiment of the present disclosure;
fig. 3 schematically shows a flow diagram of a data processing method provided according to another embodiment of the present disclosure;
FIG. 4 schematically shows a flow diagram of a data processing method provided according to another embodiment of the present disclosure;
FIG. 5 schematically shows a flow diagram of a data processing method provided according to another embodiment of the present disclosure;
FIG. 6 schematically shows a structural diagram of a storage medium provided according to an embodiment of the present disclosure;
fig. 7 schematically shows a schematic structural diagram of a data processing apparatus provided according to an embodiment of the present disclosure;
fig. 8 schematically shows a structural diagram of a computing device provided according to an embodiment of the present disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to an embodiment of the disclosure, a data processing method, a medium, a device and a computing device are provided.
In this context, it is to be understood that the terms referred to, and the meanings of the terms, are as follows:
(1) class type feature
The category-type feature mainly refers to a feature expressed by a character string, such as a feature of gender, region, platform, user ID, article ID, and the like. The specific value of the category-type feature is called a feature value, and for example, the feature value of the feature of "region" is "a place", "B place", or "C place".
It can be seen that the feature values of the class-type features are basically text-type data. However, the computer world cannot directly process text data well, and in the feature engineering of machine learning or deep learning, the text data is often required to be converted into numerical data. For example, for the feature of "region", the feature values are converted from "a place", "B place", and "C place" to 0, 1, and 2, respectively. Currently, there are various methods for converting text-type data of class-type features into numerical data.
(2) Hash algorithm
The hash algorithm is also called as a hash algorithm, can convert information with any length into a numerical value with a fixed length and output the numerical value, and is a common method for converting text type data into numerical type data in deep learning. However, the hash algorithm is a kind of compression mapping, in other words, the output space of the hash algorithm is much smaller than the input space thereof, different input values may get the same hash value, and collision of numerical values occurs.
(3) Characteristic embedding (embedding)
Feature embedding is the representation of a discrete number by a low-dimensional dense vector, called the embedding vector. In the deep neural network, the neurons cannot process high-dimensional discrete values well, and feature values need to be converted into embedded vectors from the high-dimensional discrete values and then input into the network.
In the feature embedding process, firstly, an embedding table of m × dim needs to be constructed. Where m represents the number of vectors (reflecting the range of discrete values), dim is the vector dimension; then, for a given discrete value, an embedding vector corresponding to the discrete value in the table is obtained by looking up the embedding table.
Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.
Summary of The Invention
In the processing process of the category type features, the feature values of the category type features are mapped into numerical data from text type data, feature embedding is carried out on the numerical data to obtain low-dimensional dense embedded vectors, and finally the feature values of the category type features are represented by the embedded vectors. The number of the numerical data determines the size of the embedding table in the feature embedding process.
Currently, there are two methods for converting feature values from text-type data to numerical-type data.
The method comprises the following steps: and (4) a word list mapping method.
In the method, all characteristic values appearing in a sample are counted, and a unique numerical value is assigned to each characteristic value. For example, P feature values are mapped to P values, ranging from 0 to P-1.
The inventor finds that the word list mapping method realizes the one-to-one mapping of the characteristic value from the text to the numerical value, and the numerical value and the embedded vector are also mapped one to one in the embedded table, so the word list mapping method has no problem of collision and collision of numerical values, but has the following defects:
(1) as the number of eigenvalues grows linearly, the size of the embedded table also increases dramatically, resulting in a very large resource overhead. For example, when dealing with high-dimensional sparse features such as user IDs, the number of feature values is on the order of tens of millions or even billions, and the use of word-table mapping results in very large embedded tables.
(2) All feature values need to be counted initially. However, in the online learning of the depth model, new samples are generated continuously after the model is trained, so that all feature values cannot be counted before the model is trained. If a new feature value appearing in the training process is to be processed, there are two processing methods. One approach is to determine the new feature value as a default value, such as 0, but this will affect the learning effect of the model. Another approach is to reserve enough space in the embedded table to accommodate new feature values that may appear in the future, and when new feature values appear, the mapping of the new feature values to new values can be implemented through a dynamic management and reclamation method. However, once the number of eigenvalues exceeds the headroom of the embedded table, the embedded table needs to be resized and the model retrained. Therefore, the vocabulary mapping method has a problem of insufficient expansibility for a new eigenvalue.
Method II, Hash mapping method
In the method, a Hash algorithm is adopted, a category characteristic value is used as the input of the Hash algorithm, the Hash value is generated through Hash calculation, and then the modulus of the Hash value is taken to obtain the final numerical value. Therefore, the size of the embedded table is controlled by taking the modulus of the hash value.
The inventor finds that the hash mapping method has the following advantages:
(1) compared with a vocabulary mapping method, the Hash mapping method reduces resource overhead. Assuming that the number of the characteristic values is Q and the size of the modulus is T, the Hash mapping algorithm maps the Q characteristic values into numerical values ranging from 0 to (T-1) finally, the size of the embedded table is also fixed to T and cannot be increased along with the increase of the characteristic values, T is less than Q, and the size of the embedded table is less than the number of the characteristic values.
(2) The Hash algorithm can convert input data with any length, so that all characteristic values do not need to be counted in advance, and the Hash mapping method does not have the problem of shortage under expansion.
The inventor finds that the HashMap method has a problem of collision and collision of numerical values to a certain extent, and different input data can obtain the same numerical value. For example, Q characteristic values are mapped to T numerical values, T is less than Q, collision and collision between the numerical values inevitably exist, when T is reduced, the size of an embedded table is reduced, and the collision and collision probability of the numerical values is improved; conversely, when T is increased, the collision probability of numerical value is also decreased, but the size of the embedded table is increased, and the resource overhead is increased.
In order to reduce the collision probability of the characteristic value converted into the numerical value in the numerical data and avoid overlarge resource cost, the method disclosed by the invention utilizes the idea of multi-path Hash, adopts a plurality of modules to perform modular processing on the Hash value corresponding to the characteristic value in the process of converting the characteristic value into the numerical value to obtain a plurality of modular results, and then determines the numerical value corresponding to the characteristic value based on the plurality of modular results. Therefore, on one hand, the advantages of controllable resource overhead and strong expansibility of the Hash mapping method are reserved, and on the other hand, the probability of collision of numerical values is effectively reduced in a multi-path Hash mode.
Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
Application scene overview
Referring first to fig. 1, fig. 1 schematically illustrates an application scenario diagram provided according to an embodiment of the present disclosure, where the application scenario is a processing scenario of feature values, for example, an application scenario of a deep learning model (i.e., a prediction scenario of the deep learning model), a training scenario of the deep learning model. As shown in fig. 1, the devices involved in the application scenario include a data processing device 101. The data processing apparatus 101 is an electronic apparatus, such as a server or a terminal, and fig. 1 takes the data processing apparatus 101 as an example.
In the process of processing the characteristic value, the data processing apparatus 101 performs operations such as hash mapping and modulo processing on the characteristic value of the sample, so as to convert the characteristic value of the sample into numerical data. The data processing device 101 may also perform model application or training based on the feature values of the numerical data.
Among them, for example: in an application scene of the deep learning model, a sample is a test sample; in the training process of the deep learning model, the samples are training samples. The characteristic value of the sample is, for example, a characteristic value of a characteristic such as a sex of the sample, a region where the sample is located, or the like.
Optionally, as shown in fig. 1, the device related to the application scenario to which the embodiment of the present disclosure is applicable may further include a data input/output device 102. The data input/output device 102 is a terminal, and communicates with the data processing device 101 via a network, for example. In the process of processing the characteristic value, the data input/output device 102 may transmit sample information of a sample to the data processing device 101, where the sample information of the sample includes the characteristic value of the sample; the data input/output device 102 may also receive model output data transmitted by the data processing device 101 and output the model output data.
The terminal may be a Personal Digital Assistant (PDA) device, a handheld device with a wireless communication function (e.g., a smart phone or a tablet), a computing device (e.g., a Personal Computer (PC)), an in-vehicle device, a wearable device (e.g., a smart watch or a smart band), a smart home device (e.g., a smart display device), and the like.
Exemplary method
A data processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 5 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
Illustratively, the method embodiments of the present disclosure may be applied to an electronic device, such as a terminal or a server.
Referring to fig. 2, fig. 2 schematically shows a flow chart of a data processing method provided according to an embodiment of the present disclosure. As shown in fig. 2, the data processing method includes:
s201, determining a hash value corresponding to the target characteristic value through a hash mapping method.
The target characteristic value refers to a characteristic value to be processed currently.
Taking model training as an example, the target feature value is a feature value of one or more features of the training sample. Taking the model application as an example, the target feature value is a feature value of one or more features of the prediction sample.
For example, in the task of predicting user behavior, whether model training or model application is performed, user characteristics of the user, such as user gender, user location area, user occupation, etc., are required. Therefore, in model training or model application, the feature values of the user features need to be processed, and then the processed feature values are input into the user behavior prediction model. The characteristic values of the user characteristics, such as the user gender, the area where the user is located, and the user occupation, can be regarded as target characteristic values in the processing process.
In this embodiment, the target feature value may be input to a hash function in a hash mapping algorithm to obtain a hash value output by the hash function, and the target feature value is successfully mapped to the hash value. The hash function can convert a characteristic value with any length into a hash value with a fixed length, and has strong expansibility for a newly appeared characteristic value. Here, the specific formula of the hash function is not limited.
S202, carrying out modulus taking processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus taking results.
The modulus is a divisor in modulus operation, and the moduli with different sizes, i.e. divisors with different numerical values.
In this embodiment, after the hash value corresponding to the target feature value is obtained, modulo processing is performed on the hash value based on a plurality of preset modulo values with different sizes, so as to obtain a plurality of modulo results. In each modulus-taking process, a modulus is used as a divisor, a hash value is used as a dividend, and division operation is carried out to obtain a remainder which is used as a modulus-taking result. For example, the hash value is modulo by a first modulo to obtain a first modulo result, and the hash value is modulo by a second modulo to obtain a second modulo result.
As an example, assume that the target feature value is VfIf the target feature value is converted into a hash value, the formula for converting the hash value into a plurality of modulo results can be expressed as:
Vhash=hash(Vf)
id1=Vhash mod m1
id2=Vhash mod m2
......
idn=Vhash mod mn
......
idK=Vhash mod mK
wherein, VhashRepresents a hash value corresponding to the target feature value, hash () represents a hash function, K is the total number of modulo, mnDenotes the nth module, idnAnd n is more than or equal to 1 and less than or equal to K.
Compared with the mode of performing modulo processing once in the hash mapping method, in this embodiment, the probability of collision of the modulo results of different target feature values is effectively reduced by performing modulo processing multiple times based on multiple modulo with different sizes. The specific analysis is as follows: calculating to obtain two different hash values based on two different target characteristic values; in a single modulus operation, the two different hash values are compressed to a smaller range, and the probability of obtaining the same modulus result by compression is higher; in performing K modulo operations on two different hash values based on K different modulo operations, the probability that K modulo results corresponding to one hash value and K modulo results obtained from another hash value all collide (i.e., have the same value) is low. In other words, when two different hash values are modulo K times and the size of the modulo K values is well enough, the two different hash values may collide with each other in a certain modulo process, but it is difficult for the modulo results in other modulo processes to collide with each other again.
And S203, determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus results.
In this embodiment, after obtaining the plurality of modulus results, the target value corresponding to the target feature value may be determined by combining the plurality of modulus results. Thus, the purpose of converting the target characteristic value into a numerical value is achieved.
In one example, a plurality of modulo results are spliced to obtain a target value.
In another example, the modulo results are weighted and summed to obtain the target value.
In this embodiment, the multiple-path hashing idea is utilized, the multiple modules are used to perform modulo processing on the hash values corresponding to the target characteristic value respectively to obtain multiple modulo results, and then the value corresponding to the target characteristic value is determined based on the multiple modulo results. Therefore, the advantages of controllable resource overhead and strong expansibility of the Hash mapping method are reserved, and the probability of collision of numerical values is effectively reduced.
Referring to fig. 3, fig. 3 schematically shows a flow chart of a data processing method provided according to another embodiment of the present disclosure. As shown in fig. 3, the data processing method includes:
s301, determining a hash value corresponding to the target characteristic value through a hash mapping method.
S302, carrying out modulus taking processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus taking results.
The implementation principle and the technical effect of S301 and S302 refer to the foregoing embodiments, and are not described again.
And S303, performing feature embedding processing on the multiple modulus taking results to obtain multiple embedding vectors.
In this embodiment, considering that in the neural network, the neuron cannot process the high-dimensional discrete value well, and the multiple modulus results corresponding to the target eigenvalue are the high-dimensional discrete values, after the multiple modulus results corresponding to the target eigenvalue are obtained, the multiple modulus results may be respectively subjected to feature embedding to obtain low-dimensional dense embedding vectors corresponding to the multiple modulus results, so as to obtain multiple embedding vectors.
In this embodiment, in the feature embedding process, the embedded vectors corresponding to the plurality of modulus results may be searched in a pre-constructed embedding table. The collision probability of the modulus results corresponding to different target characteristic values is reduced, and the collision probability of a plurality of embedded vectors corresponding to different target characteristic values is reduced accordingly.
In some embodiments, multiple embedding tables are pre-constructed, with different modules corresponding to different embedding tables.
At this time, one possible implementation manner of S303 includes: and searching an embedding vector corresponding to the nth modulus in a plurality of modulus results in an embedding table corresponding to the nth modulus in a plurality of different sizes to obtain a plurality of embedding vectors, wherein n is changed from 1 to K, and K is the total number of the plurality of different sizes of moduli.
The nth modulo result is obtained by performing modulo processing on the hash value corresponding to the target characteristic value based on the nth modulo of the multiple modulo with different sizes, that is, the modulo result corresponding to the nth modulo of the multiple modulo results of the target characteristic value.
In the embedding table corresponding to the nth module, embedding vectors corresponding to different numerical values related to the magnitude of the nth module are stored. For example, when the size of a modulus is 10, the range of the modulus obtained by modulus processing based on the modulus is an integer between 0 and 10, and the embedding vectors corresponding to the integers between 0 and 10 are stored in the embedding table corresponding to the modulus.
In this embodiment, the embedded vector corresponding to the first modulo result may be searched in the embedded table corresponding to the first modulo, the embedded vector corresponding to the second modulo result may be searched in the embedded table corresponding to the second modulo, … …, and so on, so as to obtain a plurality of embedded vectors corresponding to the target feature value. Thus, by pre-constructing different embedding tables for different modes, the size of the embedding table is effectively controlled.
Optionally, under the condition that different moduli correspond to different embedding tables, the embedding vectors in the embedding tables corresponding to the different moduli have the same dimension and different numbers.
The number of the embedded vectors in the embedded table corresponding to the modulus depends on the size of the modulus, for example, when the size of the modulus is m, the hash mapping algorithm maps the target feature value to a value in a value range of 0 to (m-1), and the number of the embedded vectors in the embedded table is m. Since the plurality of modes used in this embodiment are modes with different sizes, the number of embedded vectors in the embedding table corresponding to different modes is different.
In order to control the size of the embedded table, the dimensionality required by input data of the neural network can be equally divided according to the total number of the multiple different models in size to obtain the dimensionality of the embedded vector in the embedded table corresponding to each model, so that the size of the embedded table is effectively controlled, and overlarge resource cost is avoided.
For example, when the neural network requires that the dimension of the input data be dim and the total number of the plurality of different-sized modes be K, the dimension of the embedding vector in the embedding table corresponding to each mode is dim/K. The sizes of the embedding tables corresponding to a plurality of different-sized modules can be respectively expressed as n1,dim/K]、[n2,dim/K]、……、[nn,dim/K]、……、[nK,dim/K]. Wherein n isnAnd the number of embedded vectors in the embedding table corresponding to the nth module is represented.
And S304, combining the plurality of embedded vectors to obtain a target numerical value.
In this embodiment, after obtaining a plurality of embedded vectors corresponding to the target eigenvalue, the plurality of embedded vectors are combined to obtain a target numerical value corresponding to the target eigenvalue. Therefore, the aim of converting the target characteristic value into the numerical value is fulfilled, and the probability of converting different characteristic values into the same numerical value is reduced.
In some embodiments, one possible implementation of S304 includes: and carrying out transverse splicing processing on the plurality of embedded vectors to obtain a target numerical value.
The dimension sum of the multiple embedded vectors is equal to the dimension required by the neural network to input data to be input by the target characteristic value, and further, the dimension sum of the multiple embedded vectors is equal to the dimension required by the neural network to input data to be input by the target characteristic value.
In this embodiment, in the multiple embedded vectors corresponding to the target eigenvalue, the multiple embedded vectors may be transversely spliced according to a preset sequence, and the finally spliced embedded vector is a target numerical value. For example, after the second embedded vector is stitched transversely to the first embedded vector, a third embedded vector is stitched transversely to the second embedded vector … …. Assuming that the dimensions of the multiple embedded vectors are dim/K, the dimension of the target value obtained after the horizontal splicing is dim, wherein dim is the dimension required by the input data, and K is the total number of the multiple modes with different sizes.
In some embodiments, in addition to the horizontal stitching of the multiple embedded vectors, the target value may be solved by performing weighted summation, weighted averaging, or the like on the multiple embedded vectors.
In this embodiment, a hash value corresponding to the target feature value is determined by a hash mapping method, modulo processing is performed on the hash value based on a plurality of different modulo to obtain a plurality of modulo results, feature embedding is performed on the plurality of modulo results respectively to obtain a plurality of embedded vectors, and finally, the target value is obtained based on the plurality of embedded vectors. Therefore, the purpose of converting the characteristic values into corresponding numerical values is achieved, the characteristic values are convenient to process by models in the fields of deep learning, neural networks and the like, in the conversion process, on one hand, the probability of collision of the numerical values finally obtained by converting different characteristic values is effectively reduced by reducing the probability of collision of the modulus results of different characteristic values, and on the other hand, the advantages of controllable resource overhead and strong expansibility of the Hash algorithm in the numerical value conversion are reserved.
Processing the feature values involves two scenarios, one is to process the feature values in model training, and the other is to process the feature values in model application (also referred to as model prediction). In both cases, any of the above embodiments may be used for the processing of the feature values.
The inventor finds that in the model application, new characteristic values which do not appear in the training process or characteristic values which are not sufficiently trained in the training process occasionally bring larger deviation if being input into the model, thereby causing instability of the model effect in the model application. However, currently, the direct use of the hash mapping method cannot distinguish whether a feature value appears in the training process and is sufficiently trained.
In order to solve the above problems, the present disclosure proposes to use a frequency statistics table to count the occurrence frequency of feature values in model training, and use the frequency statistics table to filter feature values that are insufficiently trained in model application, thereby improving the stability of model effect in model application. In particular, it will be described by way of the examples which follow. Here, the specific type and structure of the model are not limited.
Referring to fig. 4, fig. 4 schematically shows a flow chart of a data processing method provided according to another embodiment of the present disclosure. As shown in fig. 4, in the model training, the data processing method includes:
s401, determining a hash value corresponding to the target characteristic value through a hash mapping method.
In the model training, the target characteristic value is the characteristic value of the training sample.
In this embodiment, in the model training stage, a target feature value is obtained from feature values of training samples. Before inputting the target feature value into the model, the target feature value is converted into a corresponding hash value by a hash mapping method, and specific implementation principles and technical effects may refer to the foregoing embodiments and are not described in detail.
S402, carrying out modulus processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus results.
The implementation principle and the counting effect of S402 can refer to the foregoing embodiments, and are not described again.
And S403, recording the occurrence times of a plurality of modulus results in the time statistics table.
The number statistical table is used for recording the number of occurrences of a modulus result obtained by performing modulus processing based on a plurality of different moduli in model training. Wherein, before the model training is started, the occurrence number of each value in the number statistical table can be initialized to 0.
In this embodiment, after obtaining the plurality of modulus results corresponding to the target feature value, the occurrence times of the plurality of modulus results corresponding to the target feature value may be recorded in the times statistical table, so that the record of the occurrence times of the feature value in the model training stage is realized. Therefore, on the one hand, compared with the mode of directly recording the appearance times of the characteristic values, the mode of recording the appearance times of the modulus results of the characteristic values does not need to count all the characteristic values in advance or reserve space for new characteristic values in the frequency statistical table, the expansibility is stronger when the characteristic values face the new characteristic values, and the size of the frequency statistical table is more controllable (in the mode, the size of the frequency statistical table depends on the sizes of a plurality of moduli used for modulus processing of the characteristic values, and the size of the frequency statistical table is also fixed under the condition that the sizes of the moduli are fixed); on the other hand, by using a multi-path hash (that is, the characteristic values are respectively subjected to modulus processing based on a plurality of moduli with different sizes), the occurrence times of a plurality of modulus results corresponding to the characteristic values are recorded, but the occurrence times of one modulus result corresponding to the characteristic values are not recorded, so that the error rate of the frequency statistical table is effectively reduced, the accuracy of frequency statistics is improved, and the filtering effect of the characteristic values in the model application is further improved.
In this embodiment, after performing hash mapping on one target feature value and performing multiple modulo based on multiple different modulo sizes to obtain multiple modulo results, an operation may be performed to add one to the number of occurrences of the multiple modulo results in the number statistical table. For example, after hash mapping and multiple modulo operations are performed, if multiple modulo results corresponding to the target feature value a1 are 1, 2, 3, and 4, an operation of adding one to the occurrence frequency of 1, 2, 3, and 4 is performed in the frequency statistical table.
In some embodiments, the number of statistics tables is multiple, and different models correspond to different number of statistics tables. The number of times statistical table is used for recording the occurrence times of a modulus result obtained by performing modulus taking processing based on a modulus corresponding to the number of times statistical table in model training. At this time, one possible implementation manner of S403 includes: and in the number statistical table corresponding to the nth module, adding one to the occurrence number of the nth module result in the plurality of module results, wherein the value of n is changed from 1 to K, and K is the total number of the plurality of modules with different sizes.
In this embodiment, if the number of occurrences of a plurality of modulo results obtained after performing modulo processing on a target feature value based on different models is counted in the same number of times statistical table, the modulo results obtained based on different models may affect the number of occurrences in the number of times statistical table, and the number of occurrences of one feature value cannot be truly reflected, resulting in low accuracy of the number of times statistics.
For example, assuming that the target feature value a2 is the first target feature value, and the plurality of modulo results corresponding to the target feature value a2 are 1, and 3, the number of occurrences of these modulo results is counted in the number-of-occurrences statistical table. If different models correspond to different times statistical tables, the occurrence times of 1, and 3 in the different times statistical tables are all 1, which can accurately reflect that the occurrence time of the target characteristic value a2 is 1. If different modes correspond to the same frequency statistical table, the frequency of occurrence of 1 in the frequency statistical table is 3, the frequency of occurrence of 3 is 1, and there is a certain deviation compared with the frequency of occurrence of the target feature value a2, and the deviation gradually increases as the number of the target feature values increases, so that the error rate of the frequency statistical table becomes higher and higher.
Therefore, in this embodiment, different modulo are used to correspond to different times statistical tables, and the nth times statistical table is used to count the occurrence times of the modulo result obtained by performing the modulo operation based on the nth modulo. In one possible implementation, K modulo results may be obtained after modulo the hash value of one target feature value based on K different modulo sizes. And aiming at the K modulus results, adding one to the occurrence frequency of the nth modulus result in a frequency statistical table corresponding to the nth modulus.
For example, assume that K is 4. After Hash mapping and multiple modulus operations are performed, 4 modulus operation results corresponding to the target characteristic value a1 are obtained and are 1, 2, 3 and 4, the number of occurrences of 1 is added by one in the number of times statistical table b1, the number of occurrences of 2 is added by one in the number of times statistical table b2, the number of occurrences of 3 is added by one in the number of times statistical table b3, and the number of occurrences of 4 is added by one in the number of times statistical table b 4. Then, after hash mapping and multiple modulo operations are performed, 4 modulo results corresponding to the target feature value a2 are obtained as 1, and 3, and then one is added to the number of occurrences of 1 in the number statistics table b1, one is added to the number of occurrences of 1 in the number statistics table b2, one is added to the number of occurrences of 1 in the number statistics table b3, and one is added to the number of occurrences of 4 in the number statistics table b 4. By analogy, the frequency counting accuracy is improved by using the frequency counting table corresponding to the plurality of modules.
And S404, determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus results.
In this embodiment, a plurality of modulo results of the target characteristic value are combined to obtain a target value corresponding to the target characteristic value, so that the model can process the target characteristic value conveniently through processing the target value.
The implementation principle and the technical effect of obtaining the target value by combining the multiple modulo results of the target characteristic value may refer to the foregoing embodiments, and are not described in detail.
S403 and S404 may be executed synchronously, or may be executed in a sequential order (first execute S403 and then execute S404, or first execute S404 and then execute S403).
In this embodiment, in the model training stage, a hash value corresponding to the target feature value is determined by a hash mapping method, the hash value is subjected to modulo processing based on a plurality of different modulo values to obtain a plurality of modulo results, the occurrence frequency of the plurality of modulo results is counted in the frequency statistical table, and the target value is obtained based on the plurality of modulo results. Therefore, the characteristic value is converted into a corresponding numerical value, and the characteristic value is processed by the model in the fields of deep learning, neural networks and the like. On one hand, the probability of collision of numerical values obtained by the final conversion of different characteristic values is effectively reduced; on the other hand, the advantages of controllable resource overhead and strong expansibility of the hash algorithm in the numerical value conversion are reserved; in another aspect, the frequency statistical table can reflect the occurrence frequency of the characteristic value in the training process, so that the stability of the model effect in the model application is improved conveniently.
Referring to fig. 5, fig. 5 schematically shows a flow chart of a data processing method provided according to another embodiment of the present disclosure. As shown in fig. 5, in the model application, the data processing method includes:
s501, determining a hash value corresponding to the target characteristic value through a hash mapping method.
In the model application, the target characteristic value is the characteristic value of the prediction sample.
In this embodiment, in the model application, a target feature value is obtained from the feature values of the prediction samples. Before inputting the target feature value into the model, the target feature value is converted into a corresponding hash value by a hash mapping method, and specific implementation principles and technical effects may refer to the foregoing embodiments and are not described in detail.
S502, carrying out modulus processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus results.
And S503, determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus results.
The implementation principle and the counting effect of S502 to S503 can refer to the foregoing embodiments, and are not described again.
And S504, filtering the target characteristic value based on the frequency statistical table.
The number of times statistical table is used to record the occurrence times of the modulus result obtained by performing the modulus extraction processing based on a plurality of different sizes in the model training, and the recording process may refer to the foregoing embodiments and will not be described again.
In this embodiment, after obtaining a plurality of modulus results corresponding to the target feature value, first, the number of occurrences of the plurality of modulus results, that is, the number of occurrences of the plurality of modulus results in the model training may be obtained in the number statistical table. Then, the number of occurrences of the target feature value in the model training may be determined based on the number of occurrences of the plurality of modulus results, wherein the number of occurrences of the target feature value in the model training reflects whether the target feature value is sufficiently trained in the model training. Therefore, the target characteristic value can be filtered according to the occurrence frequency of the target characteristic value in model training, and the problem that the stability of the model effect in model application is influenced by the insufficiently trained target characteristic value is avoided.
And inputting the filtered target characteristic value into the model for characteristic processing.
In one example, the number of occurrences of the target feature value may be determined as an average, median, or mode of the number of occurrences of the plurality of modulo results corresponding to the target feature value.
In another example, when the target feature value appears once in the model training, the appearance times of the plurality of modulus results corresponding to the target feature value in the number statistical table are all increased by one, and it is known that the appearance times of the target feature value is certainly less than or equal to the minimum value of the appearance times of the plurality of modulus results, and therefore, the appearance times of the target feature value can be determined to be the minimum value of the appearance times of the plurality of modulus results corresponding to the target feature value, so as to improve the accuracy of determining the appearance times of the target feature value.
In the process of filtering the target characteristic value according to the occurrence times of the target characteristic value in model training, the occurrence times of the target characteristic value can be compared with a time threshold value, and if the occurrence times of the target characteristic value is smaller than the time threshold value, the target characteristic value is discarded.
In one example, filtering of the target characteristic value can be achieved by determining that the target value corresponding to the target characteristic value is zero, and adverse effects on the stability of the model effect in the model application caused by the insufficiently trained target characteristic value are avoided.
Optionally, when a plurality of modulo results corresponding to the target eigenvalue need to be feature-embedded to obtain a plurality of embedded vectors, and a target numerical value corresponding to the target eigenvalue is determined by combining the plurality of embedded vectors, the target numerical value corresponding to the target eigenvalue (i.e., the corresponding embedded vector) may be determined to be a zero vector, so as to implement filtering of the target eigenvalue.
In some embodiments, the number of statistics tables is multiple, and different models correspond to different number of statistics tables. At this time, one possible implementation manner of S504 includes: in the application of the model, the occurrence times of a plurality of modulus results corresponding to the target characteristic value are obtained in a frequency statistical table respectively corresponding to a plurality of different-size moduli; determining the occurrence frequency of the target characteristic value in the model training as the minimum occurrence frequency of the median of the occurrence frequencies of the multiple modulus results; and if the occurrence frequency of the target characteristic value in the model training is smaller than the frequency threshold value, determining that the target value corresponding to the target characteristic value is zero.
The plurality of times statistical tables are used for counting the occurrence times of the modulus-taking result obtained by modulus-taking processing based on the modulus corresponding to the times statistical tables in model training, and the recording process of the occurrence times of the modulus-taking result in model training can refer to the foregoing embodiment and is not described again.
In this embodiment, in the model application, after obtaining a plurality of modulus results corresponding to the target feature value, for the nth modulus result, the occurrence number of the nth modulus result, that is, the occurrence number of the nth modulus result in the model training may be obtained from the number statistical table corresponding to the nth modulus, where n is changed from 1 to K, and K is the total number of the plurality of different moduli. Thus, the occurrence frequency of a plurality of modulus results corresponding to the target characteristic value is obtained. And then, determining the occurrence frequency of the target characteristic value in the model training as the occurrence frequency with the minimum value of the occurrence frequencies of the multiple modulus results. And if the occurrence frequency of the target characteristic value in the model training is smaller than the frequency threshold value, determining that the target value corresponding to the target characteristic value is zero, and filtering the target characteristic value. Therefore, the accuracy of the occurrence times of the characteristic values is improved based on different times statistical tables corresponding to different models, and the accuracy of filtering the target characteristic values is further improved.
In this embodiment, in model application, a hash value corresponding to a target characteristic value is determined by a hash mapping method, a modulo process is performed on the hash value based on a plurality of different modulo results to obtain a plurality of modulo results, a target value corresponding to the target characteristic value is determined based on the plurality of modulo results, the target characteristic value is converted into a value, the target characteristic value is filtered by using a frequency statistical table, and the problem that the stability of a model effect in model application is affected by the target characteristic value which is not fully trained in model training is avoided.
In some embodiments, based on any of the preceding embodiments, the target feature value is text-type data. Therefore, the target characteristic value is converted from text type data to numerical type data, the problem of numerical value conflict in the conversion process is solved, and controllability and strong expansibility of resource overhead are guaranteed.
In some embodiments, the feature to which the target feature value belongs is a class-type feature. Therefore, the problems of numerical value conflict, high resource expenditure and the like in the category type feature processing process are solved.
Exemplary Medium
Having described the method of the exemplary embodiment of the present disclosure, next, a storage medium of the exemplary embodiment of the present disclosure will be described with reference to fig. 6.
Referring to fig. 6, a storage medium 60 stores therein a program product for implementing the above method according to an embodiment of the present disclosure, which may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. The readable signal medium may also be any readable medium other than a readable storage medium.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).
Exemplary devices
After introducing the media of the exemplary embodiment of the present disclosure, a data processing apparatus of the exemplary embodiment of the present disclosure is described next with reference to fig. 7, which is used for implementing the data processing method in any of the method embodiments described above, and the implementation principle and the technical effect are similar, and are not described again here.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the data processing apparatus includes:
a hash unit 701, configured to determine, by using a hash mapping method, a hash value corresponding to the target feature value;
a modulus unit 702, configured to perform modulus processing on the hash value based on a plurality of different moduli to obtain a plurality of modulus results;
the determining unit 703 is configured to determine, according to the multiple modulo results, a target value corresponding to the target characteristic value.
In some embodiments, the determining unit 703 is specifically configured to: performing characteristic embedding processing on the plurality of modulus taking results to obtain a plurality of embedded vectors; and combining the plurality of embedded vectors to obtain a target numerical value.
In some embodiments, a plurality of embedding tables are pre-constructed, different modules correspond to different embedding tables, and a plurality of mapping relationships between values and embedding vectors are stored in the embedding tables. The determining unit 703 is specifically configured to: and searching an embedding vector corresponding to the nth modulus in a plurality of modulus results in an embedding table corresponding to the nth modulus in a plurality of different sizes to obtain a plurality of embedding vectors, wherein n is changed from 1 to K, and K is the total number of the plurality of different sizes of moduli.
In some embodiments, the embedding vectors in the embedding tables corresponding to different modes have the same dimension and different numbers.
In some embodiments, the determining unit 703 is specifically configured to: and carrying out transverse splicing processing on the plurality of embedded vectors to obtain a target numerical value.
In some embodiments, a count statistics table is pre-constructed. The data processing apparatus further includes: a statistical unit 704, configured to, in model training, record the occurrence frequency of multiple modulo results in a frequency statistical table after performing modulo processing on the hash value based on multiple different modulo values to obtain multiple modulo results.
In some embodiments, the number of statistics tables is multiple, and different models correspond to different number of statistics tables. The statistical unit 704 is specifically configured to: in the model training, in the number statistical table corresponding to the nth module, adding one to the occurrence number of the nth module result in the plurality of module results, wherein the value of n is changed from 1 to K, and K is the total number of the plurality of modules with different sizes.
In some embodiments, a number statistical table is constructed in advance, and the number statistical table is used for recording the occurrence number of modulus extraction results obtained by performing modulus extraction processing based on a plurality of moduli with different sizes in model training. The data processing apparatus further includes: the filtering unit 705 is configured to filter the target feature value based on the frequency statistical table in the model application.
In some embodiments, the number of times statistical table is multiple, different models correspond to different number of times statistical tables, and the number of times statistical table is used to record the occurrence times of a modulus-taking result obtained by performing modulus-taking processing based on the model corresponding to the number of times statistical table in model training. The filtering unit 705 is specifically configured to: in the application of the model, the occurrence times of a plurality of modulus taking results are obtained in a frequency statistical table respectively corresponding to a plurality of different-size moduli; determining the occurrence frequency of the target characteristic value in the model training as the minimum occurrence frequency of the median of the occurrence frequencies of the multiple modulus results; and if the occurrence frequency of the target characteristic value in the model training is smaller than the frequency threshold value, determining that the target value corresponding to the target characteristic value is zero.
In some embodiments, the feature to which the target feature value belongs is a class-type feature, and the target feature value is text-type data.
Exemplary computing device
Having described the methods, media, and apparatus of the exemplary embodiments of the present disclosure, a computing device of the exemplary embodiments of the present disclosure is described next with reference to fig. 8.
The computing device 80 shown in fig. 8 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the disclosure.
As shown in fig. 8, computing device 80 is embodied in the form of a general purpose computing device. Components of computing device 80 may include, but are not limited to: the at least one processing unit 801 and the at least one memory unit 802, and a bus 803 connecting the various system components (including the processing unit 801 and the memory unit 802).
The bus 803 includes a data bus, a control bus, and an address bus.
The storage unit 802 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)8021 and/or cache memory 8022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 8023.
Storage unit 802 can also include a program/utility 8025 having a set (at least one) of program modules 8024, such program modules 8024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 80 may also communicate with one or more external devices 804 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 805. Moreover, computing device 80 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 806. As shown in fig. 8, a network adapter 806 communicates with the other modules of the computing device 80 via the bus 803. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 80, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the data processing apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method of data processing, comprising:
determining a hash value corresponding to the target characteristic value by a hash mapping method;
performing modulus processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus results;
and determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus results.
2. The data processing method according to claim 1, wherein the determining a target numerical value corresponding to the target characteristic value according to the plurality of modulo results comprises;
performing feature embedding processing on the multiple modulus results to obtain multiple embedding vectors;
and combining the plurality of embedded vectors to obtain the target numerical value.
3. The data processing method according to claim 2, wherein a plurality of embedding tables are pre-constructed, different models correspond to different embedding tables, mapping relationships between a plurality of numerical values and embedding vectors are stored in the embedding tables, and the performing feature embedding processing on the plurality of modulus results to obtain a plurality of embedding vectors comprises:
and searching an embedding vector corresponding to the nth modulus in the modulus results in the embedding table corresponding to the nth modulus in the modulus with different sizes to obtain a plurality of embedding vectors, wherein n is changed from 1 to K, and K is the total number of the modulus with different sizes.
4. The data processing method of claim 3, wherein the embedded vectors in the embedded tables corresponding to different modes have the same dimension and different numbers.
5. The data processing method of claim 2, said combining the plurality of embedded vectors to obtain the target value comprising:
and performing transverse splicing processing on the plurality of embedded vectors to obtain the target numerical value.
6. The data processing method according to any one of claims 1 to 5, wherein a number statistical table is pre-constructed, and after performing modulo processing on the hash value based on a plurality of different modulo values to obtain a plurality of modulo results, the method further comprises:
in model training, recording the occurrence times of the plurality of modulus results in the times statistical table.
7. The data processing method according to claim 6, wherein the number of times statistic table is a plurality of, different models correspond to different number of times statistic table, and the recording the occurrence number of the plurality of modulus-taking results in the number of times statistic table in model training comprises:
in the model training, in a frequency statistical table corresponding to an nth module, adding one to the occurrence frequency of the nth module result in the plurality of module results, wherein the value of n is changed from 1 to K, and K is the total number of the plurality of modules with different sizes.
8. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement a data processing method as claimed in any one of claims 1 to 7.
9. A data processing apparatus comprising:
the hash unit is used for determining a hash value corresponding to the target characteristic value through a hash mapping method;
the modulus taking unit is used for carrying out modulus taking processing on the hash value based on a plurality of moduli with different sizes to obtain a plurality of modulus taking results;
and the determining unit is used for determining a target numerical value corresponding to the target characteristic value according to the plurality of modulus taking results.
10. A computing device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the data processing method of any of claims 1 to 7.
CN202111372458.6A 2021-11-18 2021-11-18 Data processing method, medium, device and computing equipment Pending CN114064840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111372458.6A CN114064840A (en) 2021-11-18 2021-11-18 Data processing method, medium, device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111372458.6A CN114064840A (en) 2021-11-18 2021-11-18 Data processing method, medium, device and computing equipment

Publications (1)

Publication Number Publication Date
CN114064840A true CN114064840A (en) 2022-02-18

Family

ID=80278336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111372458.6A Pending CN114064840A (en) 2021-11-18 2021-11-18 Data processing method, medium, device and computing equipment

Country Status (1)

Country Link
CN (1) CN114064840A (en)

Similar Documents

Publication Publication Date Title
CN104008064B (en) The method and system compressed for multi-level store
KR101868830B1 (en) Weight generation in machine learning
JP6212217B2 (en) Weight generation in machine learning
CN109783490B (en) Data fusion method and device, computer equipment and storage medium
CN112560996A (en) User portrait recognition model training method, device, readable storage medium and product
US20220398834A1 (en) Method and apparatus for transfer learning
WO2024036662A1 (en) Parallel graph rule mining method and apparatus based on data sampling
CN110198473B (en) Video processing method and device, electronic equipment and computer readable storage medium
EP4060526A1 (en) Text processing method and device
CN114494814A (en) Attention-based model training method and device and electronic equipment
CN111460117B (en) Method and device for generating intent corpus of conversation robot, medium and electronic equipment
CN111241106B (en) Approximation data processing method, device, medium and electronic equipment
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
CN116662876A (en) Multi-modal cognitive decision method, system, device, equipment and storage medium
CN114064840A (en) Data processing method, medium, device and computing equipment
EP4020327A2 (en) Method and apparatus for training data processing model, electronic device and storage medium
CN114238611B (en) Method, apparatus, device and storage medium for outputting information
US20200242467A1 (en) Calculation method and calculation device for sparse neural network, electronic device, computer readable storage medium, and computer program product
CN114357180A (en) Knowledge graph updating method and electronic equipment
CN113361621A (en) Method and apparatus for training a model
CN113408724A (en) Model compression method and device
CN112948584A (en) Short text classification method, device, equipment and storage medium
CN111667028A (en) Reliable negative sample determination method and related device
EP4080382A2 (en) Method and apparatus for generating reminder audio, electronic device and storage medium
CN114840692B (en) Image library construction method, image retrieval method, image library construction device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination