CN106651408B

CN106651408B - Data analysis method and device

Info

Publication number: CN106651408B
Application number: CN201510713052.8A
Authority: CN
Inventors: 张研; 杨冠军; 蒋程诚
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: NANJING SUNING ELECTRONIC INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2020-12-25
Anticipated expiration: 2035-10-28
Also published as: CN106651408A

Abstract

The embodiment of the invention discloses a data analysis method and device, relates to the technical field of internet, and can correct the estimation error and improve the accuracy of the pushed data. The method of the invention comprises the following steps: extracting log information of a push server, and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content; updating the weight value of each dimension by using two mutually exclusive sample sets through a logistic regression online learning algorithm; and determining a pushing result according to the updated weight value of each dimension. The method and the device are suitable for improving the accuracy of the pushed content.

Description

Data analysis method and device

Technical Field

The invention relates to the technical field of internet, in particular to a data analysis method and device.

Background

With the development of internet technology, especially online search technology, each large e-commerce advertisement platform has introduced its own advertisement delivery scheme. Due to the complexity of business information, various factors such as regions, humanity, user groups and the like are often required to be involved in a search service, and in order to improve the accuracy of an advertisement push result, the click rate of a user needs to be analyzed and estimated.

In the click rate pre-estimation system adopted at present, text similarity calculation is mainly carried out on search words input by a user, scores of all candidate advertisements are determined according to a preset scoring rule, and the pushing priority is determined according to the scores. However, in practical applications, the e-commerce advertisement platform needs to process a large amount of user retrieval data every day, and the retrieval requirements of users are often influenced by the dynamic changes of the market, and the preset scoring rules are difficult to conform to the retrieval targets of the users all the time, so that the advertisements finally pushed to the users have large errors with the expectations of the users. For example: when the user searches for the 'cell phone', the candidate advertisement A, B is shown, wherein the score of the advertisement A is higher than that of the advertisement B according to the text similarity and the preset scoring rule, and the showing rank of the advertisement A is necessarily better than that of the advertisement B. However, due to temporary sales promotion activities or quick marketing means such as WeChat marketing, the advertisement B better meets the retrieval requirements of users and more users choose to click on the advertisement B.

Therefore, the scheme of pushing the advertisement through text similarity calculation and scoring rule setting in the prior art has lower accuracy of the pushed advertisement due to larger estimation error.

Disclosure of Invention

The embodiment of the invention provides a data analysis method and device, which can correct the estimated error and improve the accuracy of the pushed data.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a data analysis method, including:

extracting log information of a push server, and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;

updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;

and determining a pushing result according to the updated weight value of each dimension.

With reference to the first aspect, in a first possible implementation manner of the first aspect, a sample set includes business information and user click information of at least two dimensions, where the types of the business information at least include: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the updating, by using the two mutually exclusive sample sets, the weight value of each dimension through a logistic regression online learning algorithm includes:

acquiring click values of a first sample set according to the two mutually exclusive sample sets

And a second sampleClick value of a collection

Wherein the two mutually exclusive sample sets are represented as (I)_click，I_noclick) X represents an identification value of one dimension, and w represents an influence coefficient of the dimension on the whole click;

obtaining a loss function according to the click value of the first sample set and the click value of the second sample set

l_t(w_t)＝y_t log p_t+(1-y_t)log(1-p_t) And obtaining a gradient function grad ═ p according to the loss function_t-y_tWherein y is_tRepresenting the actual click value, t represents the sample number 1-y_tRepresenting the actual unchecked value.

And updating the weight value of each dimension according to the gradient function.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the updating the weight values of the respective dimensions according to the gradient function includes:

obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample set_t；

According to Euclidean distance formula n_t＝n_t-1+g_t ²Updating Euclidean distance of each dimension and according to learning rate formula

Updating the learning rate of the dimension of each dimension, n_t-1Representing the sum of Euclidean distances of gradients from the 1 st sample to the t-1 st sample;

updating the weight value of each dimensionality according to the updated learning rate and Euclidean distance

Where ω represents the weight value of each dimensionAnd the weight set is formed, and alpha and beta respectively represent manual adjustment parameters.

With reference to the third possible implementation manner of the first aspect, the method further includes:

according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula

And obtaining click rate values of all sample elements.

In a second aspect, an embodiment of the present invention provides a data analysis method, including: the reading module is used for extracting log information of the push server and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;

the weight updating module is used for updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm;

and the pushing module is used for determining a pushing result according to the updated weight value of each dimension.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the business information and the user click information of at least two dimensions are included in one sample set, and the types of the business information at least include: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.

The data analysis method and the data analysis device provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight values of all dimensions of the pushed data in real time, and re-determine the push result according to the updated weight values of all dimensions. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1a and 1b are schematic diagrams of specific application scenarios provided in an embodiment of the present invention;

FIG. 2 is a flow chart of a data analysis method provided by an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data analysis apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The method flow in the embodiment of the present invention may be executed by a server that undertakes a data push function, and in this embodiment, may be referred to as a push server, for example: fig. 1a shows a push server according to an embodiment of the invention. The push server comprises an input unit, a processor unit, an output unit, a communication unit, a storage unit, a peripheral unit and other components. These components communicate over one or more buses. It will be appreciated by those skilled in the art that the push server configuration shown in the figures is not intended to limit the invention, and may be a bus or star configuration, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. In an embodiment of the present invention.

The input unit is used for realizing interaction between an operator and a technician and the push server and/or inputting information into the push server. For example, the input unit may receive numerical or character information input by an operator, technician, to generate signal inputs related to operator, technician settings, or functional controls. In the embodiment of the present invention, the input unit may be a touch panel, other human-computer interaction interfaces, or other external information capturing devices.

The processor unit is a control center of the push server, connects each part of the whole push server by using various interfaces and lines, and executes various functions and/or processes data of the push server by running or executing software programs and/or modules stored in the storage unit and calling data stored in the storage unit. The processor unit may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the Processor Unit may include only a Central Processing Unit (CPU), or may be a combination of a GPU, a Digital Signal Processor (DSP), and a control chip (e.g., a baseband chip) in the communication Unit. In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.

The communication unit is used for establishing a communication channel, so that the push server can connect to other server devices through the communication channel, or communicate with the user terminal through a wired or wireless network, such as: the push server accesses the mobile wireless network through the interface and sends the advertisement content or the advertisement desired push information (URL) to the user terminal through the mobile wireless network. In different embodiments of the present invention, the various communication modules in the communication unit are generally in the form of Integrated Circuit chips (Integrated Circuit chips), and may be selectively combined without including all the communication modules and corresponding antenna groups. For example, the communication unit may comprise only a baseband chip, a radio frequency chip and a corresponding antenna to provide communication functionality in a cellular communication system. The push server may be connected to a Cellular Network (Cellular Network) or the Internet (Internet) via a wireless communication connection established by the communication unit, such as a wireless local area Network access or a WCDMA access. In some alternative embodiments of the present invention, the communication module, e.g., the baseband module, in the communication unit may be integrated into a processor unit, typically an APQ + MDM family platform as provided by the Qualcomm corporation.

The output unit includes, but is not limited to, an image output unit and a sound output unit. The image output unit is used for outputting characters, pictures and/or videos. The image output unit may include a display panel.

The storage unit can be used for storing software programs and modules, and the processing unit executes various functional applications of the push server and realizes data processing by running the software programs and modules stored in the storage unit. The storage unit mainly includes a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, such as a data analysis program for executing the method flow of the present embodiment, a monitoring program for correcting the weight values of the respective dimensions in real time, and the like, and for example, the reading module, the weight updating module, and the pushing module shown in fig. 3 may be stored in the storage unit in the form of program codes and run through the processor unit.

The data store may store data created from the use of the push server, such as: in this embodiment, the information of the storage unit recording the interaction state of the user and the push server may be referred to as log information. In an embodiment of the invention, the Memory unit may include an volatile Memory, such as a Nonvolatile dynamic Random Access Memory (NVRAM), a Phase Change Random Access Memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, and may further include a Nonvolatile Memory, such as at least one magnetic disk Memory device, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a flash Memory device, such as a flash Memory (NOR) or a flash Memory (NAND) or a flash Memory.

The power supply is used for supplying power to different components of the push server to maintain the operation of the push server. As a general understanding, the power source may be a built-in battery, such as a common lithium ion battery, a nickel metal hydride battery, and the like, and also include an external power source that directly supplies power to the push server, such as an AC adapter, and the like. In some embodiments of the present invention, the power supply may be more broadly defined and may include, for example, a power management system, a charging system, a power failure detection circuit, a power converter or inverter, a power status indicator (e.g., a light emitting diode), and any other components associated with the generation, management, and distribution of power to a push server.

It should be noted that, in a similar scheme of this embodiment, the method flow in the embodiment of the present invention may also be executed by a server group that undertakes a data pushing function, for example: as shown in fig. 1b, the server a is configured to execute the method flow of the present embodiment and determine a push result according to the updated weight values of the respective dimensions, the server b is configured to send the push result to the terminal device of the user through the internet, and a device group consisting of the server a and the server b undertakes a data push function.

The data analysis method provided by the embodiment of the invention, as shown in fig. 2, includes:

101, extracting the log information of the push server, and acquiring two mutually exclusive sample sets according to the log information.

In a specific application of this embodiment, such as an e-commerce platform, at least two dimensions of commerce information and user click information are included in a sample set, where the types of the commerce information at least include: and the user click information is used for indicating whether the user clicks the displayed advertisement or not.

The sample elements of each sample set comprise at least two dimensions of information and mutual exclusion information, and the sample elements in the same sample set have the mutual exclusion information with the same content. For example: two sample sets are respectively I_clickAnd I_noclickIn which I_clickRepresenting a collection of advertisements clicked on by a user, wherein the sample elements may be represented in the form of: { user encoding; commodity coding; a user search term; advertising content; click, user code can be information used for identifying user identity, such as MAC address and user account of user terminal equipment, commodity code can be code of commodity type pointed by search word input by user, and user search word can be search word input by user on e-commerce platform interface (i.e. user sets terminal equipment with user search word)And the search word is input and sent to the push server or the search server by the terminal device), the "advertisement content" may be text information of the advertisement pushed to the user, and the "click" is used to indicate that the user has clicked the pushed advertisement, and if the "click" indicates that the user has not clicked the pushed advertisement. Wherein, the "click" and the "noclick" can be used as mutual exclusion information, and the sample elements are screened according to the mutual exclusion information and are divided into two sample sets I_clickAnd I_noclick。

It should be noted that the information of each dimension and the mutual exclusion information may be extracted from log information on the storage device by the push server, where the log information is used to record the interaction state of the user and the push server. For example: the method comprises the steps of acquiring original display log information from an online by deploying a log acquisition function to an existing retrieval system, analyzing and splicing the acquired original display log information, extracting information of a plurality of dimensions such as information of user codes, commodity codes, user search words, advertisement auction words and the like, recording whether to click/trigger advertisements or not, and forming a plaintext sample of the advertisements as a sample element, wherein the method specifically comprises the following steps: the push server acquires the query word searched by the user each time and corresponding display advertisements and click advertisements; splitting the search word of one session and the displayed advertisement to form an advertisement corresponding to each search word; splicing the click information of the advertisements to form an advertisement (showing, clicking) or a set (showing, not clicking); and then selecting dimensionality influencing advertisements (showing and clicking) according to the interaction history of the push server and the terminal equipment of the user to obtain sample elements. The interaction history of the push server and the terminal device of the user may be based on the interaction history of a large number of users, or may be an interaction history of a user group or a user within a certain period of time, and the specific sampling rule of the interaction history may change according to different application scenarios.

And 102, updating the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm.

103, determining a pushing result according to the updated weight values of the dimensions.

In this embodiment, the weight values refer to weight values corresponding to respective dimensions in sample elements, one sample element actually represents one pushed advertisement, and the pushed advertisement is composed of a user code, a commodity code, a user search term, advertisement content or more dimensions, when the pushing priority of the advertisement represented by the sample element is determined, scoring is performed in a weight value calculation mode according to the weight values of the respective dimensions and the respective dimensions, and the pushing priority order between different advertisements is determined according to the scoring result. In this embodiment, the weight values of the dimensions are updated through a logistic regression online learning algorithm, and the specific manner of weight calculation and scoring rules is not limited.

In the prior art, the prediction means such as text similarity calculation and scoring rule setting are difficult to effectively adapt to the actual operation of a user, and a scheme for further accurate prediction according to the operation feedback of the user is lacked. The data analysis method provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight value of each dimensionality of the pushed data in real time, and re-determine the push result according to the updated weight value of each dimensionality. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.

In this embodiment, the process of updating the weight value of each dimension by using the two mutually exclusive sample sets and using a logistic regression online learning algorithm specifically includes:

And click value of the second sample set

Wherein the two mutually exclusive sample sets are represented as (I)_click，I_noclick). Wherein x represents an identification value of one dimension, and w represents an influence coefficient of the one dimension on the whole click. Example (b)Such as: two sample sets are divided according to whether the user clicks the advertisement: i is_clickAnd I_noclickSince there are many dimensions that affect the ad click-through rate in each set and may vary dynamically without limitation, both sets may be represented as I_clickΣ wx, where x denotes an identification value of one dimension that affects ad clicks, and w denotes an influence coefficient of this one dimension on overall clicks. Assume that the actual click value of a sample element is y_tThen the actual non-click value is 1-y_tThen, according to the logistic regression function, the estimated click value of the sample element can be obtained

And the estimated non-click value is

Obtaining a loss function l according to the click value of the first sample set and the click value of the second sample set_t(w_t)＝y_t log p_t+(1-y_t)log(1-p_t) And obtaining a gradient function (deriving the loss function) grad ═ p according to the loss function_t-y_tWherein y is_tRepresenting the actual click value, t represents the sample number 1-y_tRepresenting the actual unchecked value.

The push server can obtain a gradient value according to the gradient function, and obtains the variance and n of each estimated deviation by accumulating the square of the gradient_t＝n_t-1+grad²And obtaining an iterative learning rate of each dimension, wherein alpha and beta are respectively manual adjustment parameters and can be set by technicians. According to the sum of the variances of the iterative learning rate and the estimated deviation, by

And updating the weight value of each dimension. Wherein Sgn is a function indicating if x>0 then sgn (x) ═1, sgn (x) 0 if x is 0, and x if x is 0<0 is sgn (x) -1.

In this embodiment, the updating the weight values of each dimension according to the gradient function specifically includes:

obtaining the gradient value g of each sample element according to the gradient function, the click value of the first sample set and the click value of the second sample set_t。

According to Euclidean distance formula n_t＝n_t-1+g_t ²Updating Euclidean distances of each dimension and calculating the formula according to the learning rate

Updating the learning rate of the dimension of each dimension, n_t-1Represents the sum of the Euclidean distances of the gradients from the 1 st to the t-1 st samples. Alpha and beta respectively represent manual adjustment parameters

Where ω represents a weight set composed of weight values of respective dimensions.

In this embodiment, a local optimal solution in the logical regression may be obtained in a random gradient descent manner, that is, a gradient value g of each sample element is obtained through a gradient function, an actual click value of the sample element, and an estimated click value of the sample element_t. For example: according to Euclidean distance formula n_t＝n_t-1+g_t ²Updating Euclidean distances of all dimensions;

updating the iterative learning rate of each dimension according to a learning rate formula

Obtaining an updated actual weight value according to the iterative learning rate and the Euclidean distance

Where ω represents a weight set composed of weight values of respective dimensions

In this embodiment, after acquiring the weight set, the method further includes: according to the weight set, acquiring the weight accumulation sum of each dimension, and performing logistic regression formula

And obtaining click rate values of all sample elements. In the specific application scenario of this embodiment, by mining and analyzing the log information of the user clicking the advertisement, the weighted values of the dimensions are updated in real time, and after a logistic regression online gradient algorithm is adopted, the accuracy rate estimated in AUC (an evaluation ranking index) is increased from original 0.65 to 0.79. Therefore, the problem of large estimation error is reduced, and the accuracy of advertisement pushing is improved.

The present embodiment also provides a data analysis apparatus, as shown in fig. 3, including:

the reading module is used for extracting log information of the push server and acquiring two mutually exclusive sample sets according to the log information, wherein sample elements of each sample set comprise information of at least two dimensions and mutually exclusive information, and the sample elements in the same sample set have mutually exclusive information with the same content;

The method comprises the steps that business information and user click information of at least two dimensions are included in a sample set, and the types of the business information at least comprise the following steps: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not.

The data analysis device provided by the embodiment of the invention can analyze the log information of data interaction between the user and the push server, update the weight value of each dimensionality of the pushed data in real time, and re-determine the push result according to the updated weight value of each dimensionality. Compared with the prior art, the method and the device can update the weighted value in real time, so that the estimated error is corrected, and the accuracy of the pushed data is improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of data analysis, comprising:

determining a pushing result according to the updated weight value of each dimension;

the method comprises the steps that a push server obtains a query word searched by a user each time and corresponding display advertisements and click advertisements; splitting the search word of one session and the displayed advertisement to form an advertisement corresponding to each search word; splicing the click information of the advertisements to form an advertisement (showing, clicking) or a set (showing, not clicking); then, selecting dimensionality influencing advertisements (showing and clicking) according to the interaction history of the push server and the terminal equipment of the user to obtain sample elements;

at least two-dimensional business information and user click information are included in a sample set, and the types of the business information at least comprise: the system comprises a user code, a commodity code, a user search word and an advertisement auction word, wherein the user click information is used for indicating whether the user clicks the displayed advertisement or not;

the updating of the weight value of each dimension by using the two mutually exclusive sample sets through a logistic regression online learning algorithm comprises:

And click value of the second sample set

l_t(w_t)＝y_tlogp_t+(1-y_t)log(1-p_t) And obtaining a gradient function grad ═ p according to the loss function_t-y_tWherein y is_tRepresenting the actual click value, t represents the sample number 1-y_tRepresenting an actual unchecked value;

updating the weight value of each dimension according to the gradient function;

the updating the weight values of the dimensions according to the gradient function includes:

Wherein, omega represents a weight set composed of weight values of all dimensions, alpha and beta respectively represent artificial adjustment parameters, and sigma_tRepresents the updated learning rate;

further comprising:

And obtaining click rate values of all sample elements.

2. A data analysis apparatus, comprising:

the pushing module is used for determining a pushing result according to the updated weight value of each dimension;

And click value of the second sample set

Wherein the two mutually exclusive sample sets are represented as (I)_click，I_noclick) X denotes an identification value of one dimension, and w denotes the oneInfluence coefficient of dimension on integral click;

updating the weight value of each dimension according to the gradient function;

Wherein, omega represents a weight set formed by weight values of all dimensions, and alpha and beta respectively represent manual adjustment parameters;

further comprising:

And obtaining click rate values of all sample elements.