CN112184279A - AUC index rapid calculation method and device and computer equipment - Google Patents

AUC index rapid calculation method and device and computer equipment Download PDF

Info

Publication number
CN112184279A
CN112184279A CN201910604730.5A CN201910604730A CN112184279A CN 112184279 A CN112184279 A CN 112184279A CN 201910604730 A CN201910604730 A CN 201910604730A CN 112184279 A CN112184279 A CN 112184279A
Authority
CN
China
Prior art keywords
bucket
sample data
data
sub
buckets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910604730.5A
Other languages
Chinese (zh)
Inventor
邓勇
何其真
王瑜
黄昉
吴安新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN201910604730.5A priority Critical patent/CN112184279A/en
Publication of CN112184279A publication Critical patent/CN112184279A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for quickly calculating AUC indexes, which comprises the following steps: acquiring sample data and a prediction probability corresponding to each sample data; counting the number X of positive sample data and the number Y of negative sample data; respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. The invention also provides a device for quickly calculating the AUC index, computer equipment and a computer readable storage medium. The invention can realize rapid calculation of AUC index under less hardware resources, and improves calculation speed and efficiency.

Description

AUC index rapid calculation method and device and computer equipment
Technical Field
The invention relates to the technical field of machine learning model training, in particular to a method and a device for quickly calculating an AUC (AUC index), computer equipment and a computer readable storage medium.
Background
With the rapid development of computer technology, computer devices are also widely used in people's daily life. Generally, a user watches a video, watches news, or plays games through a computer device, and the computer device displays the rich and colorful contents provided by the operator to the user through a display page. Generally, the operator selectively makes data recommendations, such as advertisement recommendations, while displaying user-preferred content. However, the selection of the recommended data often needs to meet the preference of the user, so that the situation that the user feels the recommended data and the adhesion degree of the operator is reduced can be avoided. Therefore, when an operator selects recommendation data, generally, the operator needs to obtain page content information viewed or clicked by a user, then predict favorite content of the user through a machine learning model, then select favorite content with the highest probability predicted by the machine learning model, and recommend the recommendation data corresponding to the favorite content to a display page of the user.
However, the user's preferences typically change over time, and therefore, the machine learning model also needs to be adjusted as time goes on, so as to more accurately predict the user's preferred content. In the prior art, a preset machine learning model is predicted according to page content information viewed or clicked by a user, so that an AUC (Area Under Curve) index of the machine learning model is calculated, and then the model is adjusted according to the AUC adaptability. However, since the information of the page content viewed or clicked by the user on the computer device, i.e. the sample data, is very huge, too many hardware resources are consumed for calculating the AUC index through the existing AUC calculation method, and the efficiency is not high.
Disclosure of Invention
In view of this, the present invention provides a method, an apparatus, a computer device, and a computer-readable storage medium for fast calculating an AUC index, which can count the number X of positive sample data and the number Y of negative sample data after obtaining the sample data and the prediction probability corresponding to each sample data(ii) a Respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; then, dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. By the mode, the AUC index can be rapidly calculated under fewer hardware resources, and the calculation speed and efficiency are improved.
First, in order to achieve the above object, the present invention provides a method for rapidly calculating an AUC indicator, including:
acquiring sample data and a prediction probability corresponding to each sample data, wherein the sample data comprises positive sample data and negative sample data, and the prediction probability is the similar probability of the sample data and corresponding target data identified by an identification model; counting the number X of positive sample data and the number Y of negative sample data; establishing a plurality of data sub-buckets for positive sample data and negative sample data respectively, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label, and the bucket label comprises a bucket label 1 for storing the positive sample data and a bucket label 0 for storing the negative sample data; dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; statistic L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of negative sample data in the data sub-bucket with the bucket serial number i and the bucket label 0 is set; according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
Preferably, the method further comprises: and dividing the sample data into data sub-buckets through a mapreduce system, and counting the number of the sample data in each data sub-bucket.
Preferably, the dividing the sample data into data sub-buckets by the mapreduce system, and counting the number of the sample data in each data sub-bucket includes: setting the bucket serial number and the bucket label as keys, and setting the number of sample data in the data sub-buckets as values; and inputting the sample data into a mapreduce system, and directly obtaining an output value (key, value) comprising the sub-bucket identifier of each data sub-bucket and the number of the sample data.
Preferably, the number of data sub-buckets storing sample data can be set in an adjustable mode, and the number of the data sub-buckets is smaller than the number of the data samples.
Preferably, the step of setting a bucket number i for each data sub-bucket includes: acquiring the number n of data sub-buckets for storing positive sample data or negative sample data; taking 1-n as the barrel serial numbers of the data sub-barrels for storing positive sample data and negative sample data respectively; and setting the ratio i/n of the bucket serial number i to the number n of the data sub-buckets as the probability threshold value of the data sub-buckets corresponding to the bucket serial number i.
Preferably, the step of dividing the sample data into data sub-buckets with corresponding bucket sequence numbers according to the prediction probability includes: searching sample data corresponding to the prediction probability within the fluctuation range of the probability threshold of the data sub-bucket, wherein the fluctuation range is a preset up-down floating interval; partitioning the sample data into the data buckets.
Preferably, the calculation rule comprises the following formula:
Figure BDA0002120325510000031
Figure BDA0002120325510000032
wherein j is the barrel number, L0jM0 is the number of negative sample data in the data sub-bucket with the bucket serial number j and the bucket label 0jL0 is the number of negative sample data included in the data sub-bucket with the bucket label of 0 and the bucket serial number less than iiThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
In addition, to achieve the above object, the present invention further provides an AUC indicator fast calculation apparatus, including:
the system comprises an acquisition module, a prediction module and a processing module, wherein the acquisition module is used for acquiring sample data and a prediction probability corresponding to each sample data, the sample data comprises positive sample data and negative sample data, and the prediction probability is the similar probability of the identification model identifying the sample data and corresponding target data; the first statistical module is used for counting the number X of the positive sample data and the number Y of the negative sample data; the device comprises an establishing module, a storage module and a processing module, wherein the establishing module is used for establishing a plurality of data sub-buckets for positive sample data and negative sample data respectively and setting sub-bucket identifiers for each data sub-bucket, the sub-bucket identifiers comprise bucket serial numbers i and bucket labels, and the bucket labels comprise a bucket label 1 for storing the positive sample data and a bucket label 0 for storing the negative sample data; the dividing module is used for dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; a second statistic module for counting L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of negative sample data in the data sub-bucket with the bucket serial number i and the bucket label 0 is set; a computing module for computing a function according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
Further, the present invention also provides a computer device, which includes a memory and a processor, where the memory stores a computer program that can be executed on the processor, and when the computer program is executed by the processor, the computer program implements the steps of the AUC indicator fast calculation method as described above.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to execute the steps of the AUC indicator fast calculation method as described above.
Compared with the prior art, the AUC index rapid calculation method, the AUC index rapid calculation device, the computer equipment and the computer readable storage medium provided by the invention can acquire sample data and each sample dataAfter the prediction probability corresponding to the sample data, counting the number X of positive sample data and the number Y of negative sample data; respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; then, dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. By the mode, the AUC index can be rapidly calculated under fewer hardware resources, and the calculation speed and efficiency are improved.
Drawings
FIG. 1 is a diagram of an alternative application environment in accordance with an embodiment of the present invention;
FIG. 2 is a diagram of an alternative hardware architecture for the computer device of the present invention;
FIG. 3 is a block diagram of a program module of an embodiment of the apparatus for fast calculating AUC indicator according to the present invention;
FIG. 4 is a schematic flow chart of an embodiment of a method for fast calculating an AUC indicator according to the present invention;
FIG. 5 is a detailed flowchart of step S304 in FIG. 4;
FIG. 6 is a detailed flowchart of step S306 in FIG. 4;
FIG. 7 is a schematic flow chart of another embodiment of a method for fast calculating an AUC indicator according to the present invention;
fig. 8 is a detailed flowchart of step S606 in fig. 7.
Reference numerals:
computer equipment 1
Memory device 11
Processor with a memory having a plurality of memory cells 12
Network interface 13
AUC index rapid calculation device 200
Acquisition module 201
First statistic module 202
Building module 203
Partitioning module 204
Second statistical module 205
Computing module 206
The objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a view showing an alternative application environment according to an embodiment of the present invention; fig. 2 is a schematic diagram of an alternative hardware architecture of the computer device 1 according to the present invention.
In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. The computer device 1 is connected to a network (not shown in fig. 2) through a network interface 13, and is connected to a data warehouse through the network, wherein the data warehouse may be a stand-alone data server, a data storage unit in other computer devices, or a data storage unit inside the computer device 1. Of course, in other embodiments, the computer device 1 may also be connected to other Terminal devices, such as a Mobile Terminal (Mobile Terminal), a User Equipment (UE), a Mobile phone (handset), a portable device (portable Equipment), a PC Terminal, and the like (not shown in fig. 1), and perform data interaction with the User through the other Terminal devices. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.
It is noted that fig. 2 only shows the computer device 1 with components 11-13, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided in the computer device 1. Of course, the memory 11 may also comprise both an internal storage unit of the computer device 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the computer device 1 and various types of application software, such as program codes of the AUC indicator fast calculation apparatus 200. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally used for controlling the overall operation of the computer device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code or the processing data stored in the memory 11, for example, run an application program corresponding to the recognition model, run the AUC indicator fast calculation apparatus 200, and the like. In this embodiment, the AUC indicator fast calculation means 200 is a functional unit independent from the identification model, and in other embodiments, the AUC indicator fast calculation means 200 may also be a sub-functional unit in the identification model, which is not limited herein.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing communication connection between the computer device 1 and other terminal devices such as mobile terminals, user equipment, mobile phones and portable devices, PC terminals, and data warehouses.
In this embodiment, when the AUC indicator fast calculation apparatus 200 is installed and operated in the computer device 1, when the AUC indicator fast calculation apparatus 200 is operated, after sample data and a prediction probability corresponding to each sample data can be obtained, the number X of positive sample data and the number Y of negative sample data are counted; respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; then, dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. By the mode, the AUC index can be rapidly calculated under fewer hardware resources, and the calculation speed and efficiency are improved.
The application environment and the hardware structure and function of the related devices of the various embodiments of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described application environment and related devices.
First, the present invention provides a device 200 for fast calculating AUC indexes.
Fig. 3 is a block diagram of an embodiment of the AUC indicator fast calculation apparatus 200 according to the present invention.
In this embodiment, the AUC indicator fast calculation apparatus 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the AUC indicator fast calculation function of the embodiments of the present invention can be implemented. In some embodiments, the AUC indicator fast calculation apparatus 200 may be divided into one or more modules based on the specific operations implemented by the portions of the computer program instructions. For example, in fig. 3, the AUC indicator fast calculation apparatus 200 may be divided into an acquisition module 201, a first statistics module 202, an establishment module 203, a division module 204, a second statistics module 205, and a calculation module 206. Wherein:
the obtaining module 201 is configured to obtain sample data and a prediction probability corresponding to each sample data. The sample data comprises positive sample data and negative sample data, and the prediction probability is the similarity probability of the sample data and the corresponding target data identified by the identification model.
Specifically, the computer device 1 is connected to a data warehouse, so that sample data and a prediction probability corresponding to each sample data can be acquired from the data warehouse. Of course, the sample data in the data warehouse needs to be first input into an identification model for identification, in this embodiment, the identification model runs on the computer device 1, and in other embodiments, the identification model may also run on other computer devices. After the identification model identifies the sample data, the similarity probability of the sample data and the corresponding target data is used as the prediction probability of the sample data to be returned to the data warehouse. For example, when the recognition model is a user behavior analysis model, the sample data may be a record of a user's click or view on an application page; the user behavior analysis model may identify a probability that the user belongs to a particular user type based on these clicks or views, or may give a probability that the user likes a particular content, i.e., a predicted probability. Therefore, the obtaining module 201 may obtain the sample data and the prediction probability corresponding to each sample data after the recognition model returns the prediction probability for the sample data.
The first statistical module 202 is configured to count the number X of positive sample data and the number Y of negative sample data.
Specifically, after the obtaining module 201 obtains the sample data, the first statistics module 202 further analyzes the types of all the sample data, and then counts the number X of positive sample data and the number Y of negative sample data. In this embodiment, after the identification model identifies sample data, a probability of similarity between the positive sample data and the target data is given for the positive sample data; for negative sample data, giving the similarity probability of the negative sample data and the target data; these are all associated with the identified sample data one to one, and there is a corresponding identification for both positive and negative sample data. Therefore, the statistical module 202 may identify and count the number of positive sample data and the number of negative sample data according to the identifier of the breeding of the positive sample data and the negative sample data.
The establishing module 203 is configured to establish a plurality of data sub-buckets for positive sample data and negative sample data, and set a sub-bucket identifier for each data sub-bucket. The sub-bucket identification comprises a bucket serial number i and a bucket label, and the bucket label comprises a bucket label 1 for storing positive sample data and a bucket label 0 for storing negative sample data.
Specifically, the computer device 1 further provides an interactive interface to set data sub-buckets for storing sample data for the user, and then receives the number of data sub-buckets for storing sample data input by the user to establish the data sub-buckets. In this embodiment, the number of data sub-buckets storing sample data may be adjusted, and the number of data sub-buckets is smaller than the number of data samples. The data sub-bucket can be a partition table structure for describing the address of the stored sample data, and can also be a specific storage unit. After receiving the number of data sub-buckets input by the user, the establishing module 203 establishes a plurality of data sub-buckets for positive sample data and negative sample data respectively, and sets a sub-bucket identifier for each data sub-bucket, where the sub-bucket identifier includes a bucket serial number i and a bucket label, and the bucket label includes a bucket label 1 for storing positive sample data and a bucket label 0 for storing negative sample data. In this embodiment, the step of setting the bucket serial number i for each data sub-bucket by the establishing module 203 includes: firstly, acquiring the number n of data sub-buckets for storing positive sample data or negative sample data; and then taking 1-n as the barrel serial numbers of the data sub-barrels for storing positive sample data and negative sample data respectively. For example, if there are 10 ten thousand positive sample data and 10 ten thousand negative sample data, and the number of data sub-buckets input by the user is 1000, the establishing module 203 respectively establishes 1000 data sub-buckets with bucket serial numbers from 1 to 1000 for storing the positive sample data, and 1000 data sub-buckets with bucket serial numbers from 1 to 1000 for storing the negative sample data.
Of course, when the establishing module 203 establishes the data sub-buckets, a ratio i/n of the bucket serial number i to the number n of the data sub-buckets is also set as a probability threshold of the data sub-buckets corresponding to the bucket serial number i. For example, as described above, for a data sub-bucket with a bucket number of 500, the probability threshold corresponding to the data sub-bucket is 500/1000 ═ 0.50.
The dividing module 204 is configured to divide the sample data into data sub-buckets with corresponding bucket sequence numbers according to the prediction probability.
Specifically, the partitioning module 204 first finds sample data corresponding to the prediction probability within a fluctuation range of a probability threshold of the data bucket, where the fluctuation range is a preset up-down floating interval; then, the sample data is divided into the data sub-buckets. In this embodiment, when the establishing module 203 sets the probability threshold for each data sub-bucket, it also sets a fluctuation range for dividing the sample data whose prediction probability is not equal to the probability threshold into the data sub-buckets. For example, when the fluctuation range is set to-0.05 to +0.05, the sample data with the prediction probability of (0.50-0.05) to (0.50+0.05) is divided into the data sub-buckets corresponding to the probability threshold of 0.50, i.e., the data sub-buckets with the bucket number of 500.
The second statistical module 205 is configured to count out L1iAnd L0iWherein, L1iIs a barrelNumber of positive sample data in data sub-bucket with sequence number i and bucket label 1, L0iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
Specifically, after the dividing module 204 divides all positive sample data into data sub-buckets storing positive sample data and divides all negative sample data into data sub-buckets storing negative sample data, the second counting module 205 counts the number of sample data included in each data sub-bucket. In this embodiment, L represents the number of sample data in the data bucket, and therefore, the second statistical module 205 counts L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
It should be noted that, in this embodiment, the computer device 1 divides the sample data into corresponding data sub-buckets through the dividing module 204, and counts the number of sample data in each data sub-bucket through the second counting module 205.
Of course, in another embodiment, the computer device 1 may also divide the sample data into data sub-buckets through the mapreduce system, and count the number of sample data in each data sub-bucket. The mapreduce system is a programming model for mapping and reduction, and can directly classify and count data in a database. Specifically, the computer device 1 inputs all sample data to a preset map function by calling the mapreduce system, and then outputs an output value < key, value >. In this embodiment, the computer device 1 sets a map function of a mapreduce system in advance, specifically, sets a bucket serial number and a bucket tag as keys, and sets the number of sample data in data sub-buckets as value; then, the sample data is input into the mapreduce system, and the output value < key, value > including the sub-bucket identifier of each data sub-bucket and the number of sample data can be directly obtained. The mapreduce system implements the technology of stipulating and counting data, which belongs to the prior art and is not described herein.
The calculation module 206 is used for calculating according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
Specifically, the calculation rule includes the following formula:
Figure BDA0002120325510000111
Figure BDA0002120325510000112
wherein j is the barrel number, L0jThe number of the negative sample data in the data sub-bucket with the bucket serial number j and the bucket label 0,
Figure BDA0002120325510000113
M0il0 is the number of negative sample data included in the data sub-bucket with the bucket label of 0 and the bucket serial number less than iiThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
In this embodiment, if the accuracy of the recognition model is very high, the recognition model has a higher prediction probability for positive sample data than for negative sample data, in which case the AUC is 1. However, the recognition model in practice is not so accurate, and therefore, there are three cases when the recognition model recognizes sample data: the prediction probability of the positive sample data is higher than that of the negative sample data, under the condition, the recognition model correctly predicts the result to give beneficial information, and the aim that the model is required to achieve is achieved; the prediction probability of the positive sample data is equal to that of the negative sample data, and the identification model gives no beneficial information basically but no harmful information; when the prediction probability of the positive sample data is lower than that of the negative sample data, the recognition model gives misjudgment, and people can mistakenly take the negative sample data as the positive sample data.
Therefore, the positive and negative of all the sample data identified by the identification model can be constructedThe number of sample pairs is X Y, and in these many pairs, it can be counted: identifying the times that the model correctly predicts that the result gives beneficial information, and scoring; the number of times that the identification model does not give any beneficial information basically is counted for 0.5; the recognition model gives the times of misjudgment and records the score of 0. Finally, X Y is divided for normalization, so that AUC is between 0 and 1. In this manner, the calculation module 206 may then be based on X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
After the calculation module 206 calculates the AUC indicator of the identification model, the computer device 1 adaptively adjusts the identification model according to the AUC indicator. The identification model is adjusted according to AUC, which is often used in the prior art, and is not described here.
As can be seen from the above, after the computer device 1 can obtain the sample data and the prediction probability corresponding to each sample data, the number X of positive sample data and the number Y of negative sample data are counted; respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; then, dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. By the mode, the AUC index can be rapidly calculated under fewer hardware resources, and the calculation speed and efficiency are improved.
In addition, the invention also provides a method for quickly calculating the AUC index, and the method is applied to computer equipment.
Referring to fig. 4, it is a schematic flow chart of an embodiment of the method for rapidly calculating an AUC indicator of the present invention, and the method includes steps S300 to S310:
step S300, sample data and the prediction probability corresponding to each sample data are obtained. The sample data comprises positive sample data and negative sample data, and the prediction probability is the similarity probability of the sample data and the corresponding target data identified by the identification model.
Specifically, the computer device is connected to a data repository, so that sample data and a prediction probability corresponding to each sample data can be acquired from the data repository. Of course, the sample data in the data warehouse needs to be first input into an identification model for identification, in this embodiment, the identification model runs on the computer device, and in other embodiments, the identification model may also run on other computer devices. After the identification model identifies the sample data, the similarity probability of the sample data and the corresponding target data is used as the prediction probability of the sample data to be returned to the data warehouse. For example, when the recognition model is a user behavior analysis model, the sample data may be a record of a user's click or view on an application page; the user behavior analysis model may identify a probability that the user belongs to a particular user type based on these clicks or views, or may give a probability that the user likes a particular content, i.e., a predicted probability. Therefore, the computer device may obtain the sample data and the prediction probability corresponding to each sample data after the recognition model returns the prediction probability for the sample data.
Step S302, counting the number X of positive sample data and the number Y of negative sample data.
Specifically, after the computer device obtains the sample data, the computer device may further analyze the types of all the sample data, and then count the number X of positive sample data and the number Y of negative sample data. In this embodiment, after the identification model identifies sample data, a probability of similarity between the positive sample data and the target data is given for the positive sample data; for negative sample data, giving the similarity probability of the negative sample data and the target data; these are all associated with the identified sample data one to one, and there is a corresponding identification for both positive and negative sample data. Therefore, the computer equipment can identify and count the number of the positive sample data and the number of the negative sample data according to the mark bred by the positive sample data and the negative sample data.
Step S304, a plurality of data sub-buckets are respectively established for the positive sample data and the negative sample data, and a sub-bucket identifier is set for each data sub-bucket. The sub-bucket identification comprises a bucket serial number i and a bucket label, and the bucket label comprises a bucket label 1 for storing positive sample data and a bucket label 0 for storing negative sample data.
Specifically, the computer device also provides an interactive interface for setting data sub-buckets for storing sample data for users, and then receives the number of the data sub-buckets for storing the sample data input by the users to establish the data sub-buckets. In this embodiment, the number of data sub-buckets storing sample data may be adjusted, and the number of data sub-buckets is smaller than the number of data samples. The data sub-bucket can be a partition table structure for describing the address of the stored sample data, and can also be a specific storage unit. After receiving the number of data sub-buckets input by a user, the computer equipment respectively establishes a plurality of data sub-buckets for positive sample data and negative sample data, and sets a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label, and the bucket label comprises a bucket label 1 for storing the positive sample data and a bucket label 0 for storing the negative sample data.
Referring to FIG. 5, in an exemplary embodiment, step S304 of FIG. 4 may include steps S400-S404:
step S404, acquiring the number n of data sub-buckets for storing positive sample data or negative sample data.
And S402, taking 1-n as the barrel serial numbers of the data sub-barrels for storing the positive sample data and the negative sample data respectively.
Step S404, setting the ratio i/n of the bucket serial number i to the number n of the data sub-buckets as the probability threshold value of the data sub-buckets corresponding to the bucket serial number i.
In this embodiment, the step of setting, by the computer device, a bucket serial number i for each data sub-bucket includes: firstly, acquiring the number n of data sub-buckets for storing positive sample data or negative sample data; and then taking 1-n as the barrel serial numbers of the data sub-barrels for storing positive sample data and negative sample data respectively. For example, if there are 10 ten thousand positive sample data and 10 ten thousand negative sample data, and the number of data sub-buckets input by the user is 1000, the establishing module 203 respectively establishes 1000 data sub-buckets with bucket serial numbers from 1 to 1000 for storing the positive sample data, and 1000 data sub-buckets with bucket serial numbers from 1 to 1000 for storing the negative sample data.
Of course, when the computer device establishes the data sub-buckets, the ratio i/n of the bucket serial number i to the number n of the data sub-buckets is set as the probability threshold of the data sub-buckets corresponding to the bucket serial number i. For example, as described above, for a data sub-bucket with a bucket number of 500, the probability threshold corresponding to the data sub-bucket is 500/1000 ═ 0.50.
And S306, dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability.
Referring to fig. 6, in an exemplary embodiment, step S306 in fig. 4 may include steps S500 to S502:
step S500, sample data corresponding to the prediction probability within the fluctuation range of the probability threshold of the data sub-bucket is found out, wherein the fluctuation range is a preset up-down floating interval.
Step S502, dividing the sample data into the data sub-buckets.
Specifically, the computer device first finds out sample data corresponding to a prediction probability within a fluctuation range of a probability threshold of the data sub-bucket, where the fluctuation range is a preset up-down floating interval; then, the sample data is divided into the data sub-buckets. In this embodiment, when the computer device sets the probability threshold for each data sub-bucket, it also sets a fluctuation range for dividing the sample data whose prediction probability is not equal to the probability threshold into the data sub-buckets. For example, when the fluctuation range is set to-0.05 to +0.05, the sample data with the prediction probability of (0.50-0.05) to (0.50+0.05) is divided into the data sub-buckets corresponding to the probability threshold of 0.50, i.e., the data sub-buckets with the bucket number of 500.
Step S308, counting out L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
Specifically, when the computer device divides all positive sample data into data sub-buckets storing positive sample data and divides all negative sample data into data sub-buckets storing negative sample data, the number of the sample data included in each data sub-bucket is further counted. In this embodiment, L represents the number of sample data in the data bucket, so the computer device counts out L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
Step S310, according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
Specifically, the calculation rule includes the following formula:
Figure BDA0002120325510000151
L1iM0i))/X*Y,(0<i) (ii) a Wherein j is the barrel number, L0jThe number of the negative sample data in the data sub-bucket with the bucket serial number j and the bucket label 0,
Figure BDA0002120325510000152
M0il0 is the number of negative sample data included in the data sub-bucket with the bucket label of 0 and the bucket serial number less than iiThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
In this embodiment, if the accuracy of the recognition model is very high, the recognition model has a higher prediction probability for positive sample data than for negative sample data, in which case the AUC is 1. However, the recognition model in practice is not so accurate, and therefore, there are three cases when the recognition model recognizes sample data: the prediction probability of the positive sample data is higher than that of the negative sample data, under the condition, the recognition model correctly predicts the result to give beneficial information, and the aim that the model is required to achieve is achieved; the prediction probability of the positive sample data is equal to that of the negative sample data, and the identification model gives no beneficial information basically but no harmful information; when the prediction probability of the positive sample data is lower than that of the negative sample data, the recognition model gives misjudgment, and people can mistakenly take the negative sample data as the positive sample data.
Therefore, for all the sample data identified by the identification model, the number of positive and negative sample pairs which can be constructed is X × Y, and in such many pairs, it can be counted that: identifying the times that the model correctly predicts that the result gives beneficial information, and scoring; the number of times that the identification model does not give any beneficial information basically is counted for 0.5; the recognition model gives the times of misjudgment and records the score of 0. Finally, X Y is divided for normalization, so that AUC is between 0 and 1. In the above manner, the computer device may then be in accordance with X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
And after the computer equipment calculates the AUC index of the identification model, further carrying out adaptive adjustment on the identification model according to the AUC index. The identification model is adjusted according to AUC, which is often used in the prior art, and is not described here.
Referring to fig. 7, a schematic flow chart of another embodiment of the method for rapidly calculating an AUC index of the present invention includes steps S300 to S304, S606 and S310. Steps S300 to S304 and step S310 are the same as those of the embodiment shown in fig. 4. Step S606 is as follows:
step S606, dividing the sample data into data sub-buckets through mapreduce system, and counting the number of sample data in each data sub-bucket, including L1i、L0iWhich isMiddle, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
Referring to FIG. 8, in an exemplary embodiment, step S606 of FIG. 7 includes steps S700-S702:
step S700, setting the bucket serial number and the bucket label as keys, and setting the number of sample data in the data sub-bucket as value.
Step S702, inputting the sample data into the mapreduce system, and directly obtaining the output value (key, value) including the sub-bucket identifier of each data sub-bucket and the number of the sample data.
In this embodiment, the computer device divides the sample data into data sub-buckets through a mapreduce system, and counts the number of the sample data in each data sub-bucket. The mapreduce system is a programming model for mapping and reduction, and can directly classify and count data in a database. Specifically, the computer device calls the mapreduce system, inputs all sample data to a preset map function, and then outputs an output value < key, value >. The computer equipment sets a map function of a mapreduce system in advance, specifically sets a bucket serial number and a bucket label as keys, and sets the number of sample data in data sub-buckets as value; then, the sample data is input into the mapreduce system, and the output value < key, value > including the sub-bucket identifier of each data sub-bucket and the number of sample data can be directly obtained. The mapreduce system implements the technology of stipulating and counting data, which belongs to the prior art and is not described herein.
As can be seen from the above, the AUC index fast calculation method provided in this embodiment can obtain sample data and the prediction probability corresponding to each sample data, and then count the number X of positive sample data and the number Y of negative sample data; respectively establishing a plurality of data sub-buckets for positive sample data and negative sample data, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label; followed byDividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability; finally, L1 is countediAnd L0iAnd according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule. By the mode, the AUC index can be rapidly calculated under fewer hardware resources, and the calculation speed and efficiency are improved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for rapidly calculating AUC indexes is characterized by comprising the following steps:
acquiring sample data and a prediction probability corresponding to each sample data, wherein the sample data comprises positive sample data and negative sample data, and the prediction probability is the similar probability of the sample data and corresponding target data identified by an identification model;
counting the number X of positive sample data and the number Y of negative sample data;
establishing a plurality of data sub-buckets for positive sample data and negative sample data respectively, and setting a sub-bucket identifier for each data sub-bucket, wherein the sub-bucket identifier comprises a bucket serial number i and a bucket label, and the bucket label comprises a bucket label 1 for storing the positive sample data and a bucket label 0 for storing the negative sample data;
dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability;
statistic L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of negative sample data in the data sub-bucket with the bucket serial number i and the bucket label 0 is set;
according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
2. The method for rapid calculation of an AUC indicator according to claim 1, further comprising:
and dividing the sample data into data sub-buckets through a mapreduce system, and counting the number of the sample data in each data sub-bucket.
3. The method for fast calculating AUC indicator according to claim 2, wherein said dividing the sample data into data sub-buckets by mapreduce system, and counting the number of sample data in each data sub-bucket comprises the steps of:
setting the bucket serial number and the bucket label as keys, and setting the number of sample data in the data sub-buckets as values;
and inputting the sample data into a mapreduce system, and directly obtaining an output value (key, value) comprising the sub-bucket identifier of each data sub-bucket and the number of the sample data.
4. The method for rapid calculation of AUC indicator of claim 1, wherein the number of data buckets for storing sample data can be adjusted and set, and the number of data buckets is smaller than the number of data samples.
5. The method for rapidly calculating an AUC indicator according to claim 1, wherein the step of setting a bucket number i for each data sub-bucket includes:
acquiring the number n of data sub-buckets for storing positive sample data or negative sample data;
taking 1-n as the barrel serial numbers of the data sub-barrels for storing positive sample data and negative sample data respectively;
and setting the ratio i/n of the bucket serial number i to the number n of the data sub-buckets as the probability threshold value of the data sub-buckets corresponding to the bucket serial number i.
6. The method for fast calculating AUC indicator according to claim 5, wherein said step of dividing said sample data into data sub-buckets with corresponding bucket numbers according to said prediction probability comprises:
searching sample data corresponding to the prediction probability within the fluctuation range of the probability threshold of the data sub-bucket, wherein the fluctuation range is a preset up-down floating interval;
partitioning the sample data into the data buckets.
7. The method for rapid calculation of an AUC measure according to claim 1, wherein said calculation rule includes the following formula:
Figure FDA0002120325500000021
Figure FDA0002120325500000022
wherein j is the barrel number, L0jM0 is the number of negative sample data in the data sub-bucket with the bucket serial number j and the bucket label 0iThe data sub-bucket with bucket label of 0 and bucket serial number less than i comprisesNumber of negative sample data of (L0)iThe number of the negative sample data in the data sub-bucket with the bucket serial number of i and the bucket label of 0.
8. An AUC indicator fast calculation apparatus, comprising:
the system comprises an acquisition module, a prediction module and a processing module, wherein the acquisition module is used for acquiring sample data and a prediction probability corresponding to each sample data, the sample data comprises positive sample data and negative sample data, and the prediction probability is the similar probability of the identification model identifying the sample data and corresponding target data;
the first statistical module is used for counting the number X of the positive sample data and the number Y of the negative sample data;
the device comprises an establishing module, a storage module and a processing module, wherein the establishing module is used for establishing a plurality of data sub-buckets for positive sample data and negative sample data respectively and setting sub-bucket identifiers for each data sub-bucket, the sub-bucket identifiers comprise bucket serial numbers i and bucket labels, and the bucket labels comprise a bucket label 1 for storing the positive sample data and a bucket label 0 for storing the negative sample data;
the dividing module is used for dividing the sample data into data sub-buckets with corresponding bucket serial numbers according to the prediction probability;
a second statistic module for counting L1iAnd L0iWherein, L1iL0 is the number of positive sample data in the data sub-bucket with bucket serial number i and bucket label 1iThe number of negative sample data in the data sub-bucket with the bucket serial number i and the bucket label 0 is set;
a computing module for computing a function according to X, Y, L1i、L0iAnd calculating the AUC index according to a preset calculation rule.
9. A computer arrangement comprising a memory, a processor, a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the AUC indicator fast calculation method according to any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program executable by at least one processor to cause the at least one processor to perform the steps of the AUC indicator fast calculation method according to any one of claims 1-7.
CN201910604730.5A 2019-07-05 2019-07-05 AUC index rapid calculation method and device and computer equipment Pending CN112184279A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910604730.5A CN112184279A (en) 2019-07-05 2019-07-05 AUC index rapid calculation method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910604730.5A CN112184279A (en) 2019-07-05 2019-07-05 AUC index rapid calculation method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN112184279A true CN112184279A (en) 2021-01-05

Family

ID=73915980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910604730.5A Pending CN112184279A (en) 2019-07-05 2019-07-05 AUC index rapid calculation method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112184279A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106648891A (en) * 2016-12-09 2017-05-10 中国联合网络通信集团有限公司 MapReduce model-based task execution method and apparatus
CN107045506A (en) * 2016-02-05 2017-08-15 阿里巴巴集团控股有限公司 Evaluation index acquisition methods and device
CN108090516A (en) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 Automatically generate the method and system of the feature of machine learning sample
CN108460049A (en) * 2017-02-21 2018-08-28 阿里巴巴集团控股有限公司 A kind of method and system of determining information category

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045506A (en) * 2016-02-05 2017-08-15 阿里巴巴集团控股有限公司 Evaluation index acquisition methods and device
CN106648891A (en) * 2016-12-09 2017-05-10 中国联合网络通信集团有限公司 MapReduce model-based task execution method and apparatus
CN108460049A (en) * 2017-02-21 2018-08-28 阿里巴巴集团控股有限公司 A kind of method and system of determining information category
CN108090516A (en) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 Automatically generate the method and system of the feature of machine learning sample

Similar Documents

Publication Publication Date Title
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN110827924B (en) Clustering method and device for gene expression data, computer equipment and storage medium
EP3702912A1 (en) Background application cleaning method and apparatus, and storage medium and electronic device
CN108566666B (en) Wi-Fi hotspot recommendation method and device and storage medium
CN108427701B (en) Method for identifying help information based on operation page and application server
CN108512883B (en) Information pushing method and device and readable medium
CN112104505B (en) Application recommendation method, device, server and computer readable storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
WO2022007626A1 (en) Video content recommendation method and apparatus, and computer device
CN107944931A (en) Seed user expanding method, electronic equipment and computer-readable recording medium
CN110019913A (en) Picture match method, user equipment, storage medium and device
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
CN112181835A (en) Automatic testing method and device, computer equipment and storage medium
CN110377721B (en) Automatic question answering method, device, storage medium and electronic equipment
CN108810916B (en) Wi-Fi hotspot recommendation method and device and storage medium
CN110110210B (en) Method and device for pushing display information
CN113779257A (en) Method, device, equipment, medium and product for analyzing text classification model
CN111414395B (en) Data processing method, system and computer equipment
CN112418442A (en) Data processing method, device, equipment and storage medium for federal transfer learning
CN112184279A (en) AUC index rapid calculation method and device and computer equipment
CN110674020A (en) APP intelligent recommendation method and device and computer readable storage medium
CN109885710B (en) User image depicting method based on differential evolution algorithm and server
CN110879841B (en) Knowledge item recommendation method, device, computer equipment and storage medium
CN112598185A (en) Agricultural public opinion analysis method, device, equipment and storage medium
CN112698877A (en) Data processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105

RJ01 Rejection of invention patent application after publication