CN106651408B - Data analysis method and device - Google Patents

Data analysis method and device Download PDF

Info

Publication number
CN106651408B
CN106651408B CN201510713052.8A CN201510713052A CN106651408B CN 106651408 B CN106651408 B CN 106651408B CN 201510713052 A CN201510713052 A CN 201510713052A CN 106651408 B CN106651408 B CN 106651408B
Authority
CN
China
Prior art keywords
sample
click
dimension
value
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510713052.8A
Other languages
Chinese (zh)
Other versions
CN106651408A (en
Inventor
张研
杨冠军
蒋程诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING SUNING ELECTRONIC INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201510713052.8A priority Critical patent/CN106651408B/en
Publication of CN106651408A publication Critical patent/CN106651408A/en
Application granted granted Critical
Publication of CN106651408B publication Critical patent/CN106651408B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a data analysis method and device, relates to the technical field of internet, and can correct the estimation error and improve the accuracy of the pushed data. The method of the invention comprises the following steps: extracting log information of a push server, and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content; updating the weight value of each dimension by using two mutually exclusive sample sets through a logistic regression online learning algorithm; and determining a pushing result according to the updated weight value of each dimension. The method and the device are suitable for improving the accuracy of the pushed content.

Description

Data analysis method and device
Technical Field
The invention relates to the technical field of internet, in particular to a data analysis method and device.
Background
With the development of internet technology, especially online search technology, each large e-commerce advertisement platform has introduced its own advertisement delivery scheme. Due to the complexity of business information, various factors such as regions, humanity, user groups and the like are often required to be involved in a search service, and in order to improve the accuracy of an advertisement push result, the click rate of a user needs to be analyzed and estimated.
In the click rate pre-estimation system adopted at present, text similarity calculation is mainly carried out on search words input by a user, scores of all candidate advertisements are determined according to a preset scoring rule, and the pushing priority is determined according to the scores. However, in practical applications, the e-commerce advertisement platform needs to process a large amount of user retrieval data every day, and the retrieval requirements of users are often influenced by the dynamic changes of the market, and the preset scoring rules are difficult to conform to the retrieval targets of the users all the time, so that the advertisements finally pushed to the users have large errors with the expectations of the users. For example: when the user searches for the 'cell phone', the candidate advertisement A, B is shown, wherein the score of the advertisement A is higher than that of the advertisement B according to the text similarity and the preset scoring rule, and the showing rank of the advertisement A is necessarily better than that of the advertisement B. However, due to temporary sales promotion activities or quick marketing means such as WeChat marketing, the advertisement B better meets the retrieval requirements of users and more users choose to click on the advertisement B.
Therefore, the scheme of pushing the advertisement through text similarity calculation and scoring rule setting in the prior art has lower accuracy of the pushed advertisement due to larger estimation error.
Disclosure of Invention
The embodiment of the invention provides a data analysis method and device, which can correct the estimated error and improve the accuracy of the pushed data.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a data analysis method, including:
extracting log information of a push server, and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;
updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;
and determining a pushing result according to the updated weight value of each dimension.
With reference to the first aspect, in a first possible implementation manner of the first aspect, a sample set includes business information and user click information of at least two dimensions, where the types of the business information at least include: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.
With reference to the first aspect, in a second possible implementation manner of the first aspect, the updating, by using the two mutually exclusive sample sets, the weight value of each dimension through a logistic regression online learning algorithm includes:
acquiring click values of a first sample set according to the two mutually exclusive sample sets
Figure BDA0000832181120000021
And a second sampleClick value of a collection
Figure BDA0000832181120000022
Wherein the two mutually exclusive sample sets are represented as (I)click,Inoclick) X represents an identification value of one dimension, and w represents an influence coefficient of the dimension on the whole click;
obtaining a loss function according to the click value of the first sample set and the click value of the second sample set
lt(wt)=yt log pt+(1-yt)log(1-pt) And obtaining a gradient function grad ═ p according to the loss functiont-ytWherein y istRepresenting the actual click value, t represents the sample number 1-ytRepresenting the actual unchecked value.
And updating the weight value of each dimension according to the gradient function.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the updating the weight values of the respective dimensions according to the gradient function includes:
obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample sett
According to Euclidean distance formula nt=nt-1+gt 2Updating Euclidean distance of each dimension and according to learning rate formula
Figure BDA0000832181120000031
Updating the learning rate of the dimension of each dimension, nt-1Representing the sum of Euclidean distances of gradients from the 1 st sample to the t-1 st sample;
updating the weight value of each dimensionality according to the updated learning rate and Euclidean distance
Figure BDA0000832181120000032
Where ω represents the weight value of each dimensionAnd the weight set is formed, and alpha and beta respectively represent manual adjustment parameters.
With reference to the third possible implementation manner of the first aspect, the method further includes:
according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula
Figure BDA0000832181120000033
And obtaining click rate values of all sample elements.
In a second aspect, an embodiment of the present invention provides a data analysis method, including: the reading module is used for extracting log information of the push server and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;
the weight updating module is used for updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;
and the pushing module is used for determining a pushing result according to the updated weight value of each dimension.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the business information and the user click information of at least two dimensions are included in one sample set, and the types of the business information at least include: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.
The data analysis method and the data analysis device provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight values of all dimensions of the pushed data in real time, and re-determine the push result according to the updated weight values of all dimensions. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1a and 1b are schematic diagrams of specific application scenarios provided in an embodiment of the present invention;
FIG. 2 is a flow chart of a data analysis method provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The method flow in the embodiment of the present invention may be executed by a server that undertakes a data push function, and in this embodiment, may be referred to as a push server, for example: fig. 1a shows a push server according to an embodiment of the invention. The push server comprises an input unit, a processor unit, an output unit, a communication unit, a storage unit, a peripheral unit and other components. These components communicate over one or more buses. It will be appreciated by those skilled in the art that the push server configuration shown in the figures is not intended to limit the invention, and may be a bus or star configuration, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. In an embodiment of the present invention.
The input unit is used for realizing interaction between an operator and a technician and the push server and/or inputting information into the push server. For example, the input unit may receive numerical or character information input by an operator, technician, to generate signal inputs related to operator, technician settings, or functional controls. In the embodiment of the present invention, the input unit may be a touch panel, other human-computer interaction interfaces, or other external information capturing devices.
The processor unit is a control center of the push server, connects each part of the whole push server by using various interfaces and lines, and executes various functions and/or processes data of the push server by running or executing software programs and/or modules stored in the storage unit and calling data stored in the storage unit. The processor unit may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the Processor Unit may include only a Central Processing Unit (CPU), or may be a combination of a GPU, a Digital Signal Processor (DSP), and a control chip (e.g., a baseband chip) in the communication Unit. In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.
The communication unit is used for establishing a communication channel, so that the push server can connect to other server devices through the communication channel, or communicate with the user terminal through a wired or wireless network, such as: the push server accesses the mobile wireless network through the interface and sends the advertisement content or the advertisement desired push information (URL) to the user terminal through the mobile wireless network. In different embodiments of the present invention, the various communication modules in the communication unit are generally in the form of Integrated Circuit chips (Integrated Circuit chips), and may be selectively combined without including all the communication modules and corresponding antenna groups. For example, the communication unit may comprise only a baseband chip, a radio frequency chip and a corresponding antenna to provide communication functionality in a cellular communication system. The push server may be connected to a Cellular Network (Cellular Network) or the Internet (Internet) via a wireless communication connection established by the communication unit, such as a wireless local area Network access or a WCDMA access. In some alternative embodiments of the present invention, the communication module, e.g., the baseband module, in the communication unit may be integrated into a processor unit, typically an APQ + MDM family platform as provided by the Qualcomm corporation.
The output unit includes, but is not limited to, an image output unit and a sound output unit. The image output unit is used for outputting characters, pictures and/or videos. The image output unit may include a display panel.
The storage unit can be used for storing software programs and modules, and the processing unit executes various functional applications of the push server and realizes data processing by running the software programs and modules stored in the storage unit. The storage unit mainly includes a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, such as a data analysis program for executing the method flow of the present embodiment, a monitoring program for correcting the weight values of the respective dimensions in real time, and the like, and for example, the reading module, the weight updating module, and the pushing module shown in fig. 3 may be stored in the storage unit in the form of program codes and run through the processor unit.
The data store may store data created from the use of the push server, such as: in this embodiment, the information of the storage unit recording the interaction state of the user and the push server may be referred to as log information. In an embodiment of the invention, the Memory unit may include an volatile Memory, such as a Nonvolatile dynamic Random Access Memory (NVRAM), a Phase Change Random Access Memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, and may further include a Nonvolatile Memory, such as at least one magnetic disk Memory device, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a flash Memory device, such as a flash Memory (NOR) or a flash Memory (NAND) or a flash Memory.
The power supply is used for supplying power to different components of the push server to maintain the operation of the push server. As a general understanding, the power source may be a built-in battery, such as a common lithium ion battery, a nickel metal hydride battery, and the like, and also include an external power source that directly supplies power to the push server, such as an AC adapter, and the like. In some embodiments of the present invention, the power supply may be more broadly defined and may include, for example, a power management system, a charging system, a power failure detection circuit, a power converter or inverter, a power status indicator (e.g., a light emitting diode), and any other components associated with the generation, management, and distribution of power to a push server.
It should be noted that, in a similar scheme of this embodiment, the method flow in the embodiment of the present invention may also be executed by a server group that undertakes a data pushing function, for example: as shown in fig. 1b, the server a is configured to execute the method flow of the present embodiment and determine a push result according to the updated weight values of the respective dimensions, the server b is configured to send the push result to the terminal device of the user through the internet, and a device group consisting of the server a and the server b undertakes a data push function.
The data analysis method provided by the embodiment of the invention, as shown in fig. 2, includes:
101, extracting the log information of the push server, and acquiring two mutually exclusive sample sets according to the log information.
In a specific application of this embodiment, such as an e-commerce platform, at least two dimensions of commerce information and user click information are included in a sample set, where the types of the commerce information at least include: and the user click information is used for indicating whether the user clicks the displayed advertisement or not.
The sample elements of each sample set comprise at least two dimensions of information and mutual exclusion information, and the sample elements in the same sample set have the mutual exclusion information with the same content. For example: two sample sets are respectively IclickAnd InoclickIn which IclickRepresenting a collection of advertisements clicked on by a user, wherein the sample elements may be represented in the form of: { user encoding; commodity coding; a user search term; advertising content; click, user code can be information used for identifying user identity, such as MAC address and user account of user terminal equipment, commodity code can be code of commodity type pointed by search word input by user, and user search word can be search word input by user on e-commerce platform interface (i.e. user sets terminal equipment with user search word)And the search word is input and sent to the push server or the search server by the terminal device), the "advertisement content" may be text information of the advertisement pushed to the user, and the "click" is used to indicate that the user has clicked the pushed advertisement, and if the "click" indicates that the user has not clicked the pushed advertisement. Wherein, the "click" and the "noclick" can be used as mutual exclusion information, and the sample elements are screened according to the mutual exclusion information and are divided into two sample sets IclickAnd Inoclick
It should be noted that the information of each dimension and the mutual exclusion information may be extracted from log information on the storage device by the push server, where the log information is used to record the interaction state of the user and the push server. For example: the method comprises the steps of acquiring original display log information from an online by deploying a log acquisition function to an existing retrieval system, analyzing and splicing the acquired original display log information, extracting information of a plurality of dimensions such as information of user codes, commodity codes, user search words, advertisement auction words and the like, recording whether to click/trigger advertisements or not, and forming a plaintext sample of the advertisements as a sample element, wherein the method specifically comprises the following steps: the push server acquires the query word searched by the user each time and corresponding display advertisements and click advertisements; splitting the search word of one session and the displayed advertisement to form an advertisement corresponding to each search word; splicing the click information of the advertisements to form an advertisement (showing, clicking) or a set (showing, not clicking); and then selecting dimensionality influencing advertisements (showing and clicking) according to the interaction history of the push server and the terminal equipment of the user to obtain sample elements. The interaction history of the push server and the terminal device of the user may be based on the interaction history of a large number of users, or may be an interaction history of a user group or a user within a certain period of time, and the specific sampling rule of the interaction history may change according to different application scenarios.
And 102, updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm.
103, determining a pushing result according to the updated weight values of the dimensions.
In this embodiment, the weight values refer to weight values corresponding to respective dimensions in sample elements, one sample element actually represents one pushed advertisement, and the pushed advertisement is composed of a user code, a commodity code, a user search term, advertisement content or more dimensions, when the pushing priority of the advertisement represented by the sample element is determined, scoring is performed in a weight value calculation mode according to the weight values of the respective dimensions and the respective dimensions, and the pushing priority order between different advertisements is determined according to the scoring result. In this embodiment, the weight values of the dimensions are updated through a logistic regression online learning algorithm, and the specific manner of weight calculation and scoring rules is not limited.
In the prior art, the prediction means such as text similarity calculation and scoring rule setting are difficult to effectively adapt to the actual operation of a user, and a scheme for further accurate prediction according to the operation feedback of the user is lacked. The data analysis method provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight value of each dimensionality of the pushed data in real time, and re-determine the push result according to the updated weight value of each dimensionality. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.
In this embodiment, the process of updating the weight value of each dimension by using the two mutually exclusive sample sets and using a logistic regression online learning algorithm specifically includes:
acquiring click values of a first sample set according to the two mutually exclusive sample sets
Figure BDA0000832181120000091
And click value of the second sample set
Figure BDA0000832181120000092
Wherein the two mutually exclusive sample sets are represented as (I)click,Inoclick). Wherein x represents an identification value of one dimension, and w represents an influence coefficient of the one dimension on the whole click. Example (b)Such as: two sample sets are divided according to whether the user clicks the advertisement: i isclickAnd InoclickSince there are many dimensions that affect the ad click-through rate in each set and may vary dynamically without limitation, both sets may be represented as IclickΣ wx, where x denotes an identification value of one dimension that affects ad clicks, and w denotes an influence coefficient of this one dimension on overall clicks. Assume that the actual click value of a sample element is ytThen the actual non-click value is 1-ytThen, according to the logistic regression function, the estimated click value of the sample element can be obtained
Figure BDA0000832181120000101
And the estimated non-click value is
Figure BDA0000832181120000102
Obtaining a loss function l according to the click value of the first sample set and the click value of the second sample sett(wt)=yt log pt+(1-yt)log(1-pt) And obtaining a gradient function (deriving the loss function) grad ═ p according to the loss functiont-ytWherein y istRepresenting the actual click value, t represents the sample number 1-ytRepresenting the actual unchecked value.
And updating the weight value of each dimension according to the gradient function.
The push server can obtain a gradient value according to the gradient function, and obtains the variance and n of each estimated deviation by accumulating the square of the gradientt=nt-1+grad2And obtaining an iterative learning rate of each dimension, wherein alpha and beta are respectively manual adjustment parameters and can be set by technicians. According to the sum of the variances of the iterative learning rate and the estimated deviation, by
Figure BDA0000832181120000104
And updating the weight value of each dimension. Wherein Sgn is a function indicating if x>0 then sgn (x) ═1, sgn (x) 0 if x is 0, and x if x is 0<0 is sgn (x) -1.
In this embodiment, the updating the weight values of each dimension according to the gradient function specifically includes:
obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample sett
According to Euclidean distance formula nt=nt-1+gt 2Updating Euclidean distances of each dimension and calculating the formula according to the learning rate
Figure BDA0000832181120000111
Updating the learning rate of the dimension of each dimension, nt-1Represents the sum of the Euclidean distances of the gradients from the 1 st to the t-1 st samples. Alpha and beta respectively represent manual adjustment parameters
Updating the weight value of each dimensionality according to the updated learning rate and Euclidean distance
Figure BDA0000832181120000112
Where ω represents a weight set composed of weight values of respective dimensions.
In this embodiment, a local optimal solution in the logical regression may be obtained in a random gradient descent manner, that is, a gradient value g of each sample element is obtained through a gradient function, an actual click value of the sample element, and an estimated click value of the sample elementt. For example: according to Euclidean distance formula nt=nt-1+gt 2Updating Euclidean distances of all dimensions;
updating the iterative learning rate of each dimension according to a learning rate formula
Figure BDA0000832181120000113
Obtaining an updated actual weight value according to the iterative learning rate and the Euclidean distance
Figure BDA0000832181120000114
Where ω represents a weight set composed of weight values of respective dimensions
In this embodiment, after acquiring the weight set, the method further includes: according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula
Figure BDA0000832181120000121
And obtaining click rate values of all sample elements. In the specific application scenario of this embodiment, by mining and analyzing the log information of the user clicking the advertisement, the weighted values of the dimensions are updated in real time, and after a logistic regression online gradient algorithm is adopted, the accuracy rate estimated in AUC (an evaluation ranking index) is increased from original 0.65 to 0.79. Therefore, the problem of large estimation error is reduced, and the accuracy of advertisement pushing is improved.
The present embodiment also provides a data analysis apparatus, as shown in fig. 3, including:
the reading module is used for extracting log information of the push server and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;
the weight updating module is used for updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;
and the pushing module is used for determining a pushing result according to the updated weight value of each dimension.
The method comprises the steps that business information and user click information of at least two dimensions are included in a sample set, and the types of the business information at least comprise the following steps: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.
The data analysis device provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight value of each dimensionality of the pushed data in real time, and re-determine the push result according to the updated weight value of each dimensionality. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (2)

1. A method of data analysis, comprising:
extracting log information of a push server, and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;
updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;
determining a pushing result according to the updated weight value of each dimension;
the method comprises the steps that a push server obtains a query word searched by a user each time and corresponding display advertisements and click advertisements; splitting the search word of one session and the displayed advertisement to form an advertisement corresponding to each search word; splicing the click information of the advertisements to form an advertisement (showing, clicking) or a set (showing, not clicking); then, selecting dimensionality influencing advertisements (showing and clicking) according to the interaction history of the push server and the terminal equipment of the user to obtain sample elements;
at least two-dimensional business information and user click information are included in a sample set, and the types of the business information at least comprise: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not;
the updating of the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm comprises:
acquiring click values of a first sample set according to the two mutually exclusive sample sets
Figure FDA0002662337260000011
And click value of the second sample set
Figure FDA0002662337260000012
Wherein the two mutually exclusive sample sets are represented as (I)click,Inoclick) X represents an identification value of one dimension, and w represents an influence coefficient of the dimension on the whole click;
obtaining a loss function according to the click value of the first sample set and the click value of the second sample set
lt(wt)=ytlogpt+(1-yt)log(1-pt) And obtaining a gradient function grad ═ p according to the loss functiont-ytWherein y istRepresenting the actual click value, t represents the sample number 1-ytRepresenting an actual unchecked value;
updating the weight value of each dimension according to the gradient function;
the updating the weight values of the dimensions according to the gradient function includes:
obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample sett
According to Euclidean distance formula nt=nt-1+gt 2Updating Euclidean distance of each dimension and according to learning rate formula
Figure FDA0002662337260000021
Updating the learning rate of the dimension of each dimension, nt-1Representing the sum of Euclidean distances of gradients from the 1 st sample to the t-1 st sample;
updating the weight value of each dimensionality according to the updated learning rate and Euclidean distance
Figure FDA0002662337260000022
Wherein, omega represents a weight set composed of weight values of all dimensions, alpha and beta respectively represent artificial adjustment parameters, and sigmatRepresents the updated learning rate;
further comprising:
according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula
Figure FDA0002662337260000031
And obtaining click rate values of all sample elements.
2. A data analysis apparatus, comprising:
the reading module is used for extracting log information of the push server and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;
the weight updating module is used for updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;
the pushing module is used for determining a pushing result according to the updated weight value of each dimension;
the method comprises the steps that a push server obtains a query word searched by a user each time and corresponding display advertisements and click advertisements; splitting the search word of one session and the displayed advertisement to form an advertisement corresponding to each search word; splicing the click information of the advertisements to form an advertisement (showing, clicking) or a set (showing, not clicking); then, selecting dimensionality influencing advertisements (showing and clicking) according to the interaction history of the push server and the terminal equipment of the user to obtain sample elements;
at least two-dimensional business information and user click information are included in a sample set, and the types of the business information at least comprise: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not;
the updating of the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm comprises:
acquiring click values of a first sample set according to the two mutually exclusive sample sets
Figure FDA0002662337260000032
And click value of the second sample set
Figure FDA0002662337260000033
Wherein the two mutually exclusive sample sets are represented as (I)click,Inoclick) X denotes an identification value of one dimension, and w denotes the oneInfluence coefficient of dimension on integral click;
obtaining a loss function according to the click value of the first sample set and the click value of the second sample set
lt(wt)=ytlogpt+(1-yt)log(1-pt) And obtaining a gradient function grad ═ p according to the loss functiont-ytWherein y istRepresenting the actual click value, t represents the sample number 1-ytRepresenting an actual unchecked value;
updating the weight value of each dimension according to the gradient function;
the updating the weight values of the dimensions according to the gradient function includes:
obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample sett
According to Euclidean distance formula nt=nt-1+gt 2Updating Euclidean distance of each dimension and according to learning rate formula
Figure FDA0002662337260000041
Updating the learning rate of the dimension of each dimension, nt-1Representing the sum of Euclidean distances of gradients from the 1 st sample to the t-1 st sample;
updating the weight value of each dimensionality according to the updated learning rate and Euclidean distance
Figure FDA0002662337260000042
Wherein, omega represents a weight set formed by weight values of all dimensions, and alpha and beta respectively represent manual adjustment parameters;
further comprising:
according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula
Figure FDA0002662337260000051
And obtaining click rate values of all sample elements.
CN201510713052.8A 2015-10-28 2015-10-28 Data analysis method and device Active CN106651408B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510713052.8A CN106651408B (en) 2015-10-28 2015-10-28 Data analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510713052.8A CN106651408B (en) 2015-10-28 2015-10-28 Data analysis method and device

Publications (2)

Publication Number Publication Date
CN106651408A CN106651408A (en) 2017-05-10
CN106651408B true CN106651408B (en) 2020-12-25

Family

ID=58816198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510713052.8A Active CN106651408B (en) 2015-10-28 2015-10-28 Data analysis method and device

Country Status (1)

Country Link
CN (1) CN106651408B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110634006B (en) * 2018-06-22 2024-03-19 阿里巴巴(中国)有限公司 Advertisement click rate prediction method, device, equipment and readable storage medium
CN112613904A (en) * 2020-12-16 2021-04-06 中国建设银行股份有限公司 Tail pasting information pushing method and device
CN112836967B (en) * 2021-02-03 2022-07-08 武汉理工大学 New energy automobile battery safety risk assessment system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054199A (en) * 2010-12-31 2011-05-11 中国人民解放军63983部队 BP (Back Propagation) neural network algorithm based method for analyzing coating aging
CN104536983A (en) * 2014-12-08 2015-04-22 北京掌阔技术有限公司 Method and device for predicting advertisement click rate
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054199A (en) * 2010-12-31 2011-05-11 中国人民解放军63983部队 BP (Back Propagation) neural network algorithm based method for analyzing coating aging
CN104536983A (en) * 2014-12-08 2015-04-22 北京掌阔技术有限公司 Method and device for predicting advertisement click rate
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于内容广告平台的点击率预估系统的设计与实现";祁全昌;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊 )》;20150315;I138-1314, 正文第1-38页 *

Also Published As

Publication number Publication date
CN106651408A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
JP7343568B2 (en) Identifying and applying hyperparameters for machine learning
EP3563243B1 (en) Determining application test results using screenshot metadata
US9087108B2 (en) Determination of category information using multiple stages
US10332184B2 (en) Personalized application recommendations
US20190108470A1 (en) Automated orchestration of incident triage workflows
CN110008973B (en) Model training method, method and device for determining target user based on model
CN111506801A (en) Sequencing method and device for sub-applications in application App
KR101390220B1 (en) Method for recommending appropriate developers for software bug fixing and apparatus thereof
CN107908662B (en) Method and device for realizing search system
CN109117442B (en) Application recommendation method and device
CN110555172A (en) user relationship mining method and device, electronic equipment and storage medium
CN110750433A (en) Interface test method and device
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN106651408B (en) Data analysis method and device
CN111435406A (en) Method and device for correcting database statement spelling errors
CN110059172B (en) Method and device for recommending answers based on natural language understanding
WO2017024684A1 (en) User behavioral intent acquisition method, device and equipment, and computer storage medium
CN110189171B (en) Feature data generation method, device and equipment
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
CN113434770B (en) Business portrait analysis method and system combining electronic commerce and big data
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN111612548B (en) Information acquisition method, information acquisition device, computer equipment and readable storage medium
CN110471708B (en) Method and device for acquiring configuration items based on reusable components
CN113961797A (en) Resource recommendation method and device, electronic equipment and readable storage medium
CN111831130A (en) Input content recommendation method, terminal device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201202

Address after: 210042 No. 1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing City, Jiangsu Province

Applicant after: Suning Cloud Computing Co.,Ltd.

Address before: 210042 Nanjing Province, Xuanwu District, Jiangsu Suning Avenue, Suning headquarters, No. 1

Applicant before: SUNING COMMERCE GROUP Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210508

Address after: 210042 no.1-9 Suning Avenue, Xuanwu District, Nanjing City, Jiangsu Province (Jiangsu Province)

Patentee after: NANJING SUNING ELECTRONIC INFORMATION TECHNOLOGY Co.,Ltd.

Address before: No.1-1 Suning Avenue, Xuzhuang Software Park, Xuanwu District, Nanjing, Jiangsu Province, 210042

Patentee before: Suning Cloud Computing Co.,Ltd.