WO2019196534A1 - 验证码的人机识别方法及装置 - Google Patents

验证码的人机识别方法及装置 Download PDF

Info

Publication number
WO2019196534A1
WO2019196534A1 PCT/CN2019/072354 CN2019072354W WO2019196534A1 WO 2019196534 A1 WO2019196534 A1 WO 2019196534A1 CN 2019072354 W CN2019072354 W CN 2019072354W WO 2019196534 A1 WO2019196534 A1 WO 2019196534A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
data
machine learning
verification code
learning model
Prior art date
Application number
PCT/CN2019/072354
Other languages
English (en)
French (fr)
Inventor
梅鵾
卢肖
王明博
谭炎
Original Assignee
众安信息技术服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 众安信息技术服务有限公司 filed Critical 众安信息技术服务有限公司
Priority to US16/392,311 priority Critical patent/US20190311114A1/en
Publication of WO2019196534A1 publication Critical patent/WO2019196534A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/316User authentication by observing the pattern of computer usage, e.g. typical user behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/36User authentication by graphic or iconic representation

Definitions

  • the present disclosure mainly relates to the technical field of machine learning, and more particularly to a human-computer recognition method and apparatus for a verification code.
  • Human-computer recognition is a public Turing machine test for identifying whether a registrant is a normal user or an abnormal user, and distinguishes between computer and human security automation.
  • An abnormal user that is, a computer or a machine, can request a login by continuously accessing the website, and simulate normal user input of a verification code to attack the website service. Therefore, it is crucial to identify whether a normal user or an abnormal user launches a defensive attack against a large website by initiating a login request.
  • the CAPTCHA is an abbreviation for "Completely Automated Public Turing test to tell Computers and Humans Apart", which is a public fully automatic program that distinguishes whether a user is a computer or a normal user. This can automatically prevent malicious users from using a specific program to make continuous login attempts to the website.
  • a method for identifying a registrant as a normal user or an abnormal user is to establish a user browsing behavior model such as a Hidden Semi-Markov model (HsMM) to monitor user access by using data obtained from a server log. Normality.
  • HsMM Hidden Semi-Markov model
  • This model is usually a statistical model with lower accuracy and slower recognition.
  • the present invention proposes a method for performing human-computer recognition using a machine learning model.
  • Machine learning is a kind of artificial intelligence. Its main purpose is to use the past experience or data to obtain certain rules from a large amount of data through an algorithm that allows the computer to automatically "learn", so as to predict or reason the future data. .
  • an embodiment of the present invention provides a human-machine recognition method for a verification code, including: collecting real-time user data when a first user inputs a verification code; and predicting real-time user data according to a machine learning model to determine The attribute of the first user, the machine learning model is obtained by training a sample data set comprising one or more sets of training sample data and a label respectively set for each set of training sample data, the label indicating the attribute of the second user .
  • the training sample data includes at least one of: behavior data of the second user, risk data of the second user, terminal information data of the second user, and real-time user data including the following At least one of the items: behavior data of the first user, risk data of the first user, and terminal information data of the first user.
  • the verification code is a slider verification code
  • the behavior data of the second user includes mouse movement trajectory data before and after the second user drags the slider verification code
  • the risk data of the second user includes The identity data and/or the credit data of the second user
  • the terminal information data of the second user includes at least one of user agent data, device fingerprint and IP address
  • the behavior data of the first user includes the first user dragging
  • the risk data of the first user includes identity data and/or credit data of the first user
  • the terminal information data of the first user includes at least at least one of user agent data, device fingerprint and IP address.
  • the attribute of the first user represents whether the first user is a normal user or an abnormal user.
  • the method of the first aspect further comprises: collecting a sample data set; and using the sample data set to train the machine learning model.
  • the method of the first aspect further comprises: adjusting the machine learning model with the real-time user data as new training sample data.
  • using a sample data set to train a machine learning model includes: characterizing each set of training sample data in one or more sets of training sample data to obtain one or more sets of samples Features; and determining parameters of the machine learning model by one or more sets of sample features and tags corresponding to each set of training sample data, respectively.
  • real-time user data is predicted according to a machine learning model, including: performing feature engineering on real-time user data to obtain real-time user features, and predicting real-time user features using a machine learning model.
  • the machine learning model is an XGboost model.
  • an embodiment of the present invention provides a human face recognition device for a verification code, including: an acquisition module, configured to collect real-time user data when a first user inputs a verification code; and a prediction module, configured to learn according to a machine The model predicts real-time user data to determine the attributes of the first user.
  • the machine learning model is obtained by training the sample data set.
  • the sample data set includes one or more sets of training sample data and is set for each set of training sample data respectively.
  • the training sample data includes at least one of: behavior data of the second user, risk data of the second user, terminal information data of the second user, and real-time user data including the following At least one of the items: behavior data of the first user, risk data of the first user, and terminal information data of the first user.
  • the verification code is a slider verification code
  • the behavior data of the second user includes mouse movement trajectory data before and after the second user drags the slider verification code
  • the risk data of the second user includes The identity data and/or the credit data of the second user
  • the terminal information data of the second user includes at least one of user agent data, device fingerprint and IP address
  • the behavior data of the first user includes the first user dragging
  • the risk data of the first user includes identity data and/or credit data of the first user
  • the terminal information data of the first user includes at least at least one of user agent data, device fingerprint and IP address.
  • the attribute of the first user represents whether the first user is a normal user or an abnormal user.
  • the apparatus of the second aspect further comprises: a collection module for collecting a sample data set; and a training module for training the machine learning model using the sample data set.
  • the apparatus of the second aspect further comprises: an adjustment module for adjusting the machine learning model with the real-time user data as new training sample data.
  • the training module is configured to feature engineering each set of training sample data in one or more sets of training sample data to obtain one or more sets of sample features, and to pass one or more The set of sample features and the tags corresponding to each set of training sample data respectively determine the parameters of the machine learning model.
  • the prediction module is configured to feature engineering real-time user data to obtain real-time user characteristics, and to predict real-time user features using a machine learning model.
  • the machine learning model is an XGboost model.
  • an embodiment of the present invention provides a computer device, including: a processor; a storage device, where the storage device includes computer instructions stored thereon, when executed by the processor, causing the processor to execute the first The human-computer recognition method of the verification code described in the aspect.
  • an embodiment of the present invention provides a computer readable storage medium comprising computer instructions stored thereon, the computer instructions, when executed by a processor, cause the processor to perform the verification code of the first aspect Human recognition method.
  • the embodiment of the invention provides a human-machine recognition method and device for verifying a code.
  • the machine learning model obtained by the training to predict the real-time user data in the verification code verification stage, it is possible to accurately identify whether the user is a normal user, and thus An abnormal user intercepted.
  • the conventionally used statistical model can only handle a small amount of data and a narrower data attribute, and in the embodiment of the present invention, a larger amount of sample data can be processed when training a machine learning model, which makes it comparable to the conventional method. Increased reliability and accuracy of predictions.
  • FIG. 1 is a schematic flowchart of a human-machine recognition method for a verification code according to an embodiment of the invention.
  • FIG. 2 is a schematic flowchart of a human-machine recognition method for a verification code according to another embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a method for training a machine learning model according to an embodiment of the invention.
  • FIG. 4 is a schematic flowchart of a method for predicting real-time user data according to an embodiment of the invention.
  • FIG. 5 is a schematic structural diagram of a human-machine recognition apparatus for a verification code according to an embodiment of the present invention.
  • FIG. 6 is a block diagram of a computer device for human-computer recognition of a verification code, according to an exemplary embodiment of the present invention.
  • the slider verification code is a kind of verification code, which is a type of verification code that requires the user to drag the slider to a certain position in the verification code verification stage to achieve the verification effect.
  • the verification code is the slider verification code
  • the invention provides a human-computer recognition method for verification code, and can then establish an accurate and robust user identification model in the verification code verification phase.
  • FIG. 1 is a schematic flowchart of a human-machine recognition method for a verification code according to an embodiment of the invention. As shown in FIG. 1, the method includes the following.
  • the machine learning model 120 predicting real-time user data according to a machine learning model to determine an attribute of the first user, the machine learning model is obtained by training a sample data set, where the sample data set includes one or more sets of training sample data and respectively for each group The label set by the training sample data, the label indicating the attribute of the second user.
  • the first user may be a user who actually uses the machine learning model to identify the verification code input by the first user
  • the second user may be a user corresponding to the sample data set.
  • a tag corresponding to each set of training sample data may be used to represent attributes of a second user that generated the set of training sample data.
  • the collected one or more sets of training sample data and the labels respectively corresponding to each set of training sample data are collectively referred to as a sample data set.
  • the embodiment of the invention provides a human-computer recognition method for verification code.
  • the machine learning model obtained by training to predict real-time user data in the verification code verification stage it is possible to accurately identify whether the user is a normal user, thereby Intercept.
  • the conventionally used statistical model can only handle a small amount of data and a narrower data attribute, and in the embodiment of the present invention, a larger amount of sample data can be processed when training a machine learning model, which makes it comparable to the conventional method. Increased reliability and accuracy of predictions.
  • machine learning model used in the embodiment of the present invention can be operated in parallel by multi-threading of the CPU, so that the speed of prediction can also be improved.
  • the attribute of the second user represents whether the second user is a normal user or an abnormal user.
  • the normal user can represent that the operation object of the input verification code is a person, and the operation object that the abnormal user can represent the input verification code is a computer or the like.
  • the training sample data of the normal user can be used as a negative sample, and the label is set to 0; at the same time, the sample data of the abnormal user can be used as a positive sample, and the label is set to 1.
  • the attribute of the first user may also represent whether the first user is a normal user or an abnormal user.
  • the attribute of the first user can be determined, that is, whether the first user is a normal user or an abnormal user.
  • the attribute of the first user/the attribute of the second user may represent other meanings set according to the predicted goal.
  • the real-time user data comprises at least one of the following: behavior data of the first user, risk data of the first user, terminal information data of the first user.
  • the training sample data includes at least one of the following: behavior data of the second user, risk data of the second user, and terminal information data of the second user.
  • the behavior data of the first user may include a motion track and/or a click behavior of the first user operating the mouse, and the like;
  • the risk data of the first user may include identity information and/or credit data of the first user, etc.;
  • the user's terminal information data may include at least one of User-agent data, device fingerprint, and client IP address.
  • the behavior data of the second user, the risk data of the second user, and the terminal information data of the second user are similar to those of the first user. To avoid repetition, details are not described herein again.
  • the risk data and the terminal information data of the potential abnormal user can be obtained through the data provider or some shared information systems.
  • the verification code is a slider verification code
  • the behavior data of the first user includes mouse movement trajectory data before and after the slider verification code is dragged by the first user
  • the behavior data of the second user includes the second The user moves the trajectory data of the mouse before and after dragging the slider verification code
  • the mouse movement trajectory data includes: an abscissa, an ordinate, a time stamp, and a number of retries for each movement of the mouse.
  • the verification code may also be other forms of verification code, such as a text or picture verification code
  • the training sample data may also be other data, such as the second user's identity information, credit data, and the like.
  • the method further comprises: collecting a sample data set; and using the sample data set to train the machine learning model.
  • each set of training sample data refers to all relevant data obtained by the computer when logging in for each second user.
  • the mouse movement track data of the one or more sets of normal users and/or abnormal users before and after dragging the slider verification code and the terminal information data of the second user may be collected by the log server, wherein the model is constructed.
  • the normal user and/or the abnormal user can be simulated to log in the website and drag the slider verification code, so that the computer obtains the mouse movement track data.
  • using a sample data set to train a machine learning model includes: performing feature engineering on each set of training sample data of one or more sets of training sample data to obtain one or more sets of sample features; And determining parameters of the machine learning model by one or more sets of sample features and tags corresponding to each set of training sample data, respectively.
  • the so-called feature engineering design refers to maximizing the extraction of features from the collected raw data, and obtaining a more comprehensive, fuller and multi-faceted representation of the original data for use by the model.
  • the feature engineering may include selecting a feature with high correlation according to the target, performing dimensionality reduction or up-dimensional processing on the data, and performing numerical processing on the original data.
  • the steps of feature engineering may also be omitted.
  • mouse movement trajectory data of one or more sets of normal users and/or abnormal users before and after dragging the slider verification code and terminal information data of the second user are collected by the log server.
  • the mouse movement trajectory data such as the abscissa, ordinate, time stamp and number of retries of each mouse movement
  • the following characteristics are extracted: the time elapsed by the mouse movement, the distance moved laterally, the maximum distance, and the average Speed, maximum speed and speed variance, distance traveled longitudinally, maximum distance, average speed, maximum speed and speed variance, number of sliding attempts, time interval before starting to slide.
  • the collected terminal information data the following features are extracted: user agent data, device fingerprint data, and IP address.
  • the user agent data may include: browser-related attributes such as an operating system and version, a CPU type, a browser and a version, a browser language, and a browser plug-in.
  • the device fingerprint data may include: a hardware ID of the device, an IMEI of the mobile phone, a Mac address of the network card, a font setting, and the like to identify feature information of the device.
  • the terminal information data is collected in addition to the behavior data of the second user, which improves the prediction accuracy of the machine learning model for the risk terminal.
  • the characterized sample data is used, i.e., one or more sets of sample features are used, and a label corresponding to each set of training sample data, respectively (in one embodiment, the label is "0" or " 1”) to determine the parameters of the machine learning model.
  • the machine learning model used is a tree-based integrated learning model XGboost (eXtreme Gradient Boosting).
  • XGboost eXtreme Gradient Boosting
  • K represents the number of trees to learn
  • x i is the input
  • F is the hypothesis space
  • f(x) is the Classification and Regression Tree (CART):
  • q(x) denotes that the sample x is assigned to a leaf node
  • w is the fraction of the leaf node
  • w q(x) represents the predicted value of the regression tree for the sample.
  • the model uses the prediction results of each regression tree in the K-regression tree to iteratively calculate to obtain the final prediction result. Also, the input samples for each regression tree are related to the training and prediction of the previous regression tree.
  • one or more sets of training sample data are separately feature engineered as described above to obtain one or more sets of sample features.
  • one or more sets of sample features are taken as x i in the data set D, and the tag corresponding to each set of training sample data is used as y i in the data set D to learn the parameters of the K-tree regression tree in the XGboost model. , that is, to determine the input x i of each regression tree and its output
  • the mapping relationship, where x i can be an n-dimensional vector or array.
  • the prediction result of the above model is obtained by inputting the known training sample data x i Compared with the actual mapped label y i of the training sample data, the model parameters are continuously adjusted until the expected accuracy is reached, and the model parameters are determined, thereby establishing a prediction model.
  • tree-based boost models other than the XGboost model may be used, or other types of machine learning models, such as random forest models, may be used.
  • the generated model is saved after the model has been modeled based on the training sample data and its corresponding label.
  • the model can be used to predict real-time users, ie, 110 and 120 can be executed.
  • the data of the first user is captured by the data collection code deployed to the login interface of the website for data burying.
  • the verification code is a slider verification code
  • the mouse movement trajectory data of the drag slider verification code and the terminal information data of the user are collected for each user who is performing the login operation. The types of these data are the same as the training sample data described above, and therefore will not be described again here.
  • the trained real-time user data is predicted using a trained machine learning model to determine the attributes of the first user.
  • 120 may include performing feature engineering on real-time user data; using a previously trained machine learning model to predict the first user to determine attributes of the first user.
  • the method of feature engineering design and the obtained feature type are similar to the method and type of feature engineering design of the training sample data described above, and thus will not be described herein.
  • the following model function is used to determine the attributes of the first user:
  • the parameters of the model function have been determined in the above steps, so that the characterization of the real-time user data as input x i can obtain the prediction result for the input.
  • the input x i can be an n-dimensional vector or an array.
  • the prediction result is "1" it means that the current login operation is an abnormal user, that is, the machine or computer program logs in, and the user is prevented from logging in; if the prediction result is "0", it indicates that the current login operation is a normal user. , allowing users to log in. Specifically, the prediction result may be fed back to the webpage front-end server, thereby implementing interception by an abnormal user.
  • the method further comprises: adjusting the machine learning model by using real-time user data as new training sample data.
  • the real-time user data is fed back to the machine learning model as new training sample data, and the model is trained to update the model parameters, thereby improving the prediction accuracy of the model.
  • the updated model is trained with a period of T+1, where T represents a natural day, ie, the relevant data for all users logging on each natural day (T) is on the second natural day after the natural day ( T+1)
  • the model training update is performed as new training sample data to adjust the model parameters.
  • the updated model can also be trained at any time interval, for example, the update can be trained in real time, the update can be trained hourly, and the like.
  • the human-computer recognition method of the verification code provided by the embodiment of the present invention can establish an accurate and robust user identification model in the verification code verification stage, thereby quickly and accurately identifying the user type.
  • a prediction accuracy of 95% can be achieved.
  • FIG. 2 is a schematic flowchart of a human-machine recognition method for a verification code according to another embodiment of the present invention. As shown in FIG. 2, the method includes the following.
  • the sample data set includes one or more sets of training sample data and a label respectively set for each set of training sample data, the label indicating an attribute of the second user corresponding to the sample data set, ie, whether the second user is a normal user or Abnormal user.
  • 240 Predicting real-time user data according to a machine learning model to determine attributes of the first user.
  • 280 may be performed after 240, or may be performed after 260 and 270, which is not limited by the present invention.
  • 220 may further include the following.
  • the label setting process can be referred to the description in FIG. 1 , and details are not described herein again.
  • 221 may also be performed prior to 220.
  • 222 may be performed before 221 or after 221. After the machine learning model is established, the machine learning model is saved and the steps after 230 and 230 can be performed.
  • 240 may further include the following.
  • the method of the feature engineering and the type of the obtained feature, and the process of determining the attribute of the first user may be referred to the description in FIG. 1 and will not be further described herein.
  • FIG. 5 is a schematic structural diagram of a human-machine recognition apparatus 500 for a verification code according to an embodiment of the present invention.
  • the apparatus 500 includes: an acquisition module 510, configured to collect real-time user data when a first user inputs a verification code; and a prediction module 520, configured to predict real-time user data according to a machine learning model to determine The attribute of the first user, the machine learning model is obtained by training a sample data set comprising one or more sets of training sample data and a label respectively set for each set of training sample data, the label indicating the attribute of the second user .
  • the embodiment of the invention provides a human-machine identification device for verifying a code.
  • the machine learning model obtained by the training to predict the real-time user data in the verification code verification phase, it is possible to accurately identify whether the user is a normal user and thus the abnormal user. Intercept.
  • the conventionally used statistical model can only handle a small amount of data and a narrower data attribute, and in the embodiment of the present invention, a larger amount of sample data can be processed when training a machine learning model, which makes it comparable to the conventional method. Increased reliability and accuracy of predictions.
  • the training sample data includes at least one of the following: behavior data of the second user, risk data of the second user, terminal information data of the second user, and the real-time user data includes the following items: At least one of: behavior data of the first user, risk data of the first user, terminal information data of the first user.
  • the verification code is a slider verification code
  • the behavior data of the second user includes mouse movement trajectory data before and after the second user drags the slider verification code
  • the risk data of the second user includes the second The identity data and/or the credit data of the user
  • the terminal information data of the second user includes at least one of user agent data, device fingerprint and IP address
  • the behavior data of the first user includes the first user verifying by dragging the slider Mouse movement track data before and after the code
  • the risk data of the first user includes identity data and/or credit data of the first user
  • the terminal information data of the first user includes at least one of user agent data, device fingerprint and IP address.
  • the attribute of the first user represents whether the first user is a normal user or an abnormal user.
  • the apparatus 500 further includes: a collection module 530 for collecting a sample data set; and a training module 540 for training the machine learning model using the sample data set.
  • the apparatus 500 further includes an adjustment module 550 for adjusting the machine learning model by using real-time user data as new training sample data.
  • the training module 540 is configured to perform feature engineering on each set of training sample data of one or more sets of training sample data to obtain one or more sets of sample features, and through one or more groups The sample features and the tags corresponding to each set of training sample data are used to determine parameters of the machine learning model.
  • the prediction module 520 is configured to perform feature engineering on real-time user data to obtain real-time user features, and to predict real-time user features using a machine learning model.
  • the machine learning model is an XGboost model.
  • FIG. 6 is a block diagram of a computer device 600 for human-computer identification of a verification code, in accordance with an exemplary embodiment of the present invention.
  • apparatus 600 includes a processing component 610 that further includes one or more processors, and memory resources represented by memory 620 for storing instructions executable by processing component 610, such as an application.
  • An application stored in memory 620 can include one or more modules each corresponding to a set of instructions.
  • the processing component 610 is configured to execute instructions to perform the human identification method of the verification code described above.
  • Apparatus 600 can also include a power supply component configured to perform power management of apparatus 600, a wired or wireless network interface configured to connect apparatus 600 to the network, and an input/output (I/O) interface.
  • Device 600 can operate based on an operating system stored in memory 620, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • a non-transitory computer readable storage medium when the instructions in the storage medium are executed by the processor of the apparatus 600, enabling the apparatus 600 to perform a human identification method of the verification code, comprising: collecting the first user input Real-time user data at the time of verification code; and prediction of real-time user data according to a machine learning model to determine attributes of the first user, the machine learning model is obtained by training a sample data set comprising one or more sets of training The sample data and the labels respectively set for each set of training sample data, the labels indicating the attributes of the second user.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like, and can store a program check code. Medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • User Interface Of Digital Computer (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种验证码的人机识别方法及装置,该方法包括:采集第一用户输入验证码时的实时用户数据;以及根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。

Description

验证码的人机识别方法及装置
本申请要求2018年4月9日提交的申请号为No.CN201810309762.8的中国申请的优先权,通过引用将其全部内容并入本文。
技术领域
本公开内容主要涉及机器学习的技术领域,更具体来说,涉及验证码的人机识别方法及装置。
发明背景
人机识别,是用于识别登录者是正常用户还是异常用户,区分计算机与人的安全自动化的公共图灵机测试。异常用户,即计算机或者机器,可以通过不断地访问网站来请求登陆,并模拟正常用户进行验证码的输入来对网站服务进行攻击。因此,通过识别发起登陆请求的是正常用户还是异常用户对大型网站进行防御攻击变得至关重要。
验证码(CAPTCHA)是“Completely Automated Public Turing test to tell Computers and Humans Apart”(全自动区分计算机和人类的图灵测试)的缩写,是一种区分用户是计算机还是正常用户的公共全自动程序,从而能够自动防止恶意用户用特定程序对网站进行不断的登录尝试。
目前一种识别登录者是正常用户或异常用户的方法是利用从服务器日志中获取的数据建立例如隐半马尔科夫模型(Hidden Semi-Markov model,简称HsMM)的用户浏览行为模型来监测用户访问的正常性。这种模型通常属于统计模型,准确性较低并且识别速度较慢。
因此,目前需要本领域技术人员迫切解决的一个技术问题是:如何建立一个准确且鲁棒的用户识别模型,以准确快速识别登录验证的用户是正常用户还是异常用户。
发明内容
鉴于上面提及的现有技术中缺乏准确和鲁棒的模型来识别用户是正常用户还是异常用户的技术问题,本发明提出了一种利用机器学习模型来进行人机识别的方法。机器学习是人工智能的一种,它的主要目的在于利用以往的经验或数据,通过能让计算机自动“学习”的算法,从大量数据中获得一定的规律,从而对未来的数据进行预测或推理。
第一方面,本发明的实施例提供了一种验证码的人机识别方法,包括:采集第一用户输入验证码时的实时用户数据;以及根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。
在本发明某些实施例中,训练样本数据包括以下各项中的至少一项:第二用户的行为数据、第二用户的风险数据、第二用户的终端信息数据,实时用户数据包括以下各项中的至少一项:第一用户的行为数据、第一用户的风险数据、第一用户的终端信息数据。
在本发明某些实施例中,验证码为滑块验证码,并且,第二用户的行为数据包括第二用户在拖动滑块验证码前后的鼠标移动轨迹数据,第二用户的风险数据包括第二用户的身份数据和/或征信数据,第二用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项,第一用户的行为数据包括第一用户在拖动滑块验证码前后的鼠标移动轨迹数据,第一用户的风险数据包括第一用户的身份数据和/或征信数据,第一用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项。
在本发明某些实施例中,第一用户的属性代表第一用户是正常用户还是异常用户。
在本发明某些实施例中,第一方面的方法还包括:收集样本数据集;使用样本数据集来训练机器学习模型。
在本发明某些实施例中,第一方面的方法还包括:将实时用户数据作为新的训练样本数据来调整机器学习模型。
在本发明某些实施例中,使用样本数据集来训练机器学习模型,包括:对一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征;以及通过一组或多组样本特征以及分别与每组训练样本数据相对应的标签来确定机器学习模型的参数。
在本发明某些实施例中,根据机器学习模型对实时用户数据进行预测,包括:对实时用户数据进行特征工程设计,以获得实时用户特征,使用机器学习模型对实时用户特征进行预测。
在本发明某些实施例中,机器学习模型为XGboost模型。
第二方面,本发明的实施例提供了一种验证码的人机识别装置,包括:采集模块,用于采集第一用户输入验证码时的实时用户数据;以及预测模块,用于根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。
在本发明某些实施例中,训练样本数据包括以下各项中的至少一项:第二用户的行为数据、第二用户的风险数据、第二用户的终端信息数据,实时用户数据包括以下各项中的至少一项:第一用户的行为数据、第一用户的风险数据、第一用户的终端信息数据。
在本发明某些实施例中,验证码为滑块验证码,并且,第二用户的行为数据包括第二用户在拖动滑块验证码前后的鼠标移动轨迹数据,第二用户的风险数据包括第二用户的身份数据和/或征信数据,第二用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项,第一用户的行为数据包括第一用户在拖动滑块验证码前后的鼠标移动轨迹数据,第一用户的风险数据包括第一用户的身份数据和/或征信数据,第一用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项。
在本发明某些实施例中,第一用户的属性代表第一用户是正常用户还是异常用户。
在本发明某些实施例中,第二方面的装置还包括:收集模块,用于收集样本数据集;训练模块,用于使用样本数据集来训练机器学习模型。
在本发明某些实施例中,第二方面的装置还包括:调整模块,用于将实时用户数据作为新的训练样本数据来调整机器学习模型。
在本发明某些实施例中,训练模块用于对一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征,以及通过一组或多组样本特征以及分别与每组训练样本数据相对应的标签来确定机器学习模型的参数。
在本发明某些实施例中,预测模块用于对实时用户数据进行特征工程设计,以获得实时用户特征,以及使用机器学习模型对实时用户特征进行预测。
在本发明某些实施例中,机器学习模型为XGboost模型。
第三方面,本发明的实施例提供了一种计算机设备,包括:处理器;存储设备,存储设备包括存储在其上的计算机指令,计算机指令在被处理器执行时,使得处理器执行第一方面所述的验证码的人机识别方法。
第四方面,本发明的实施例提供了一种计算机可读存储介质,包括存储在其上的计算机指令,计算机指令在被处理器执行时,使得处理器执行第一方面所述的验证码的人机识别方法。
本发明实施例提供了一种验证码的人机识别方法及装置,通过利用训练得到的机器学习模型对验证码验证阶段的实时用户数据进行预测,可以准确地识别用户是否是正常用户,从而对异常用户进行拦截。并且,传统使用的统计模型只能处理较小的数据量和较窄的数据属性,而在本发明实施例中,训练机器学习模型时能够处理更大量的样本数据,这使得相较于传统方法增加了预测的可靠性和准 确度。
附图简要说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例的描述中所需要使用的附图做简单的说明。
图1是根据本发明一实施例提供的验证码的人机识别方法的示意性流程图。
图2是根据本发明另一实施例提供的验证码的人机识别方法的示意性流程图。
图3是根据本发明一实施例提供的训练机器学习模型的方法的示意性流程图。
图4是根据本发明一实施例提供的对实时用户数据进行预测的方法的示意性流程图。
图5是根据本发明一实施例提供的验证码的人机识别装置的结构示意图。
图6是根据本发明一示例性实施例示出的用于验证码的人机识别的计算机装置的框图。
实施本发明的方式
下面将结合本发明实施例的附图来对本发明实施例中的技术方案进行清楚、完整的描述。所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其它实施例,都属于本发明保护的范围。
滑块验证码是验证码的一种,是指在验证码验证阶段,要求用户拖动滑块至某一位置,从而达到验证效果的一类验证码。在验证码为滑块验证码的情况下,在用户拖动滑块验证码的阶段,如何有效地建立精确和鲁棒的模型来识别正常用户还是异常用户,依然没有一个很好的解决方案。
本发明提出一种验证码的人机识别方法,进而能够在验证码验证阶段建立一个准确和鲁棒的用户识别模型。
图1是根据本发明一实施例提供的验证码的人机识别方法的示意性流程图。如图1所示,该方法包括如下内容。
110:采集第一用户输入验证码时的实时用户数据。
120:根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。
具体地,第一用户可以是实际使用机器学习模型对第一用户输入的验证码进行识别的用户,第二用户可以是与样本数据集对应的用户。
与每组训练样本数据对应的标签,可以用于表示生成该组训练样本数据的第 二用户的属性。在这里,将收集到的一组或多组训练样本数据以及分别与每组训练样本数据对应的标签统称为样本数据集。
本发明实施例提供了一种验证码的人机识别方法,通过利用训练得到的机器学习模型对验证码验证阶段的实时用户数据进行预测,可以准确地识别用户是否是正常用户,从而对异常用户进行拦截。并且,传统使用的统计模型只能处理较小的数据量和较窄的数据属性,而在本发明实施例中,训练机器学习模型时能够处理更大量的样本数据,这使得相较于传统方法增加了预测的可靠性和准确度。
进一步地,本发明实施例中使用的机器学习模型可以利用CPU的多线程并行运行,因此还能够提高预测的速度。
根据本发明一实施例,第二用户的属性代表第二用户是正常用户还是异常用户。
具体地,正常用户可以代表输入验证码的操作对象是人,异常用户可以代表输入验证码的操作对象是计算机等机器。此外,可以将正常用户的训练样本数据作为负样本,标签设为0;同时可以将异常用户的样本数据作为正样本,标签设为1。
与第二用户的属性相对应地,第一用户的属性也可以代表第一用户是正常用户还是异常用户。这样,在利用通过训练样本数据集得到的机器学习模型对第一用户输入的验证码进行识别时,可以确定第一用户的属性,即确定第一用户是正常用户还是异常用户。
当然,在其他实施例中,第一用户的属性/第二用户的属性可以代表根据预测目标而设定的其它含义。
根据本发明一实施例,实时用户数据包括以下各项中的至少一项:第一用户的行为数据、第一用户的风险数据、第一用户的终端信息数据。训练样本数据包括以下各项中的至少一项:第二用户的行为数据、第二用户的风险数据、第二用户的终端信息数据。
具体地,第一用户的行为数据可以包括第一用户操作鼠标的运动轨迹和/或点击行为等;第一用户的风险数据可以包括第一用户的身份信息和/或征信数据等;第一用户的终端信息数据可以包括用户代理(User-agent)数据、设备指纹和客户端IP地址中的至少一项。第二用户的行为数据、第二用户的风险数据以及第二用户的终端信息数据与第一用户的类似,为避免重复,在此不再赘述。
在本实施例中,可以通过数据提供商或者一些共享的信息系统获得潜在异常用户的风险数据和终端信息数据。
根据本发明一实施例,验证码为滑块验证码,并且,第一用户的行为数据包括第一用户在拖动滑块验证码前后的鼠标移动轨迹数据,第二用户的行为数据包括第二用户在拖动滑块验证码前后的鼠标移动轨迹数据。
具体地,鼠标移动轨迹数据包括:鼠标每次移动的横坐标、纵坐标、时间戳以及重试次数。
当然,在其他实施例中,验证码也可以是其他形式的验证码,例如文字或图片验证码,训练样本数据也可以是其他数据,例如第二用户的身份信息、征信数据等风险数据。
根据本发明一实施例,该方法还包括:收集样本数据集;使用样本数据集来训练机器学习模型。
具体地,每组训练样本数据是指计算机针对每个第二用户登录时获得的所有相关数据。在构建机器学习模型时,可以通过日志服务器收集一组或多组正常用户和/或异常用户在拖动滑块验证码前后的鼠标移动轨迹数据和第二用户的终端信息数据,其中,模型构建者可以模拟正常用户和/或异常用户登录网站拖动滑块验证码,从而使得计算机获得鼠标移动轨迹数据。
根据本发明一实施例,使用样本数据集来训练机器学习模型,包括:对一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征;以及通过一组或多组样本特征以及分别与每组训练样本数据相对应的标签来确定机器学习模型的参数。
具体地,数据是机器学习最重要的依据,所谓特征工程设计是指最大限度地从收集到的原始数据中提取特征,获得对原始数据更全面、更充分、多方位的表达,以供模型使用。特征工程可以包括根据目标选择相关性高的特征、对数据进行降维或升维处理、对原始数据进行数值计算等数据加工处理。当然,在其它实施例中,也可以省略特征工程设计的步骤。
在一个实施例中,如上面所述的,通过日志服务器收集一组或多组正常用户和/或异常用户在拖动滑块验证码前后的鼠标移动轨迹数据和第二用户的终端信息数据。根据所采集到的鼠标每次移动的横坐标、纵坐标、时间戳以及重试次数等鼠标移动轨迹数据,计算提取出如下特征:鼠标移动所经历的时间,横向移动的距离、最大距离、平均速度、最大速度和速度方差、纵向移动的距离、最大距离、平均速度、最大速度和速度方差、滑动尝试次数、开始滑动前的时间间隔。根据所采集到的终端信息数据,计算提取出如下特征:用户代理数据、设备指纹数据、IP地址。这里,用户代理数据可以包括:操作系统及版本、CPU类型、浏览器及版本、浏览器语言、浏览器插件等浏览器相关属性。设备指纹数据可以包括:设备的硬件ID、手机的IMEI、网卡的Mac地址、字体设置等标识该设备的特征信息。在这个实施例中,除了第二用户的行为数据还采集了终端信息数据,提高了机器学习模型对风险终端的预测准确性。
在本实施例中,使用经特征化的样本数据,即使用一组或多组样本特征,以及分别与每组训练样本数据相对应的标签(在一个实施例中,标签为“0”或“1”) 来确定机器学习模型的参数。
根据本发明一实施例,使用的机器学习模型是基于树的集成学习模型XGboost(eXtreme Gradient Boosting)。在该实施例中,对于给定的数据集D={(x i,y i)},XGboost模型函数形式如下:
Figure PCTCN2019072354-appb-000001
在上式中,K表示要学习的树的数目,x i为输入,
Figure PCTCN2019072354-appb-000002
表示预测结果。F是假设空间,f(x)是分类回归树CART(Classification and Regression Tree):
F={f(x)=w q(x)}(q:R m→T,w∈R T)
其中,q(x)表示将样本x分到了某个叶子节点上,w是叶子节点的分数,因此w q(x)表示回归树对样本的预测值。从上述XGboost模型函数可以看到,模型使用K棵回归树中每棵回归树的预测结果进行迭代计算,来获得最终的预测结果
Figure PCTCN2019072354-appb-000003
并且,每棵回归树的输入样本都与前面的回归树的训练和预测相关。
在一个实施例中,如上面所述的,对一组或多组训练样本数据分别进行特征工程设计,获得一组或多组样本特征。接着,将一组或多组样本特征作为数据集D中的x i,将与每组训练样本数据相对应的标签作为数据集D中的y i,来学习XGboost模型中K棵回归树的参数,也就是说,确定每棵回归树的输入x i与其输出
Figure PCTCN2019072354-appb-000004
的映射关系,其中x i可以是n维的向量或数组。即,通过输入已知的训练样本数据x i,将上述模型的预测结果
Figure PCTCN2019072354-appb-000005
与训练样本数据的实际映射的标签y i进行比较,不断调整模型参数,直到达到预期的准确率,确定模型参数,从而建立预测模型。
在其它实施例中,也可以使用除了XGboost模型以外的其它基于树的提升(boost)模型,或者也可以使用其它类型的机器学习模型,例如随机森林模型。
当已经根据训练样本数据和其对应的标签建立模型之后,保存所生成的模型。
在训练好机器学习模型之后,就可以使用该模型对实时用户进行预测,即可以执行110和120。在110中,通过被部署到网站的登录界面的数据采集代码进行数据埋点,来捕获第一用户的行为数据。在一个实施例中,验证码为滑块验证码,则针对每个正在进行登陆操作的用户采集拖动滑块验证码的鼠标移动轨迹数据和用户的终端信息数据。这些数据的种类和上面所描述的训练样本数据相同,因此在此不再赘述。接下来,在120中,使用训练好的机器学习模型来对所采集的实时用户数据进行预测,以确定第一用户的属性。
在一个实施例中,120可以包括:对实时用户数据进行特征工程设计;使用先前训练好的机器学习模型来对第一用户进行预测,确定第一用户的属性。
具体地,特征工程设计的方法和所获得的特征类型与上面所描述的对训练样本数据的特征工程设计的方法和类型类似,因此在此不再赘述。在机器学习模型 为XGboost模型的一个实施例中,使用以下模型函数确定第一用户的属性:
Figure PCTCN2019072354-appb-000006
该模型函数的参数已经在上面的步骤中被确定,因此,将特征化后的实时用户数据作为输入x i,可以获得针对该输入的预测结果
Figure PCTCN2019072354-appb-000007
其中,输入x i可以是n维的向量或数组。在一个实施例中,预测结果
Figure PCTCN2019072354-appb-000008
以“0”或“1”的方式呈现。这是因为在学习模型的参数时,所使用的标签进行了这样的定义:“0”表示正常用户,“1”表示异常用户。当然,也可以对结果/标签采用其它的定义方式,只要能够区分正常用户/异常用户即可,或者也可以定义表示其它用户属性的结果/标签。在确定第一用户的属性之后,可以输出预测结果。
如果预测结果为“1”,表示当前进行登录操作的为异常用户,也就是机器或计算机程序进行登录,则阻止该用户登录;如果预测结果为“0”,表示当前进行登录操作的为正常用户,则允许用户登录。具体地,可以将预测结果反馈给网页前端服务器,从而实现异常用户的拦截。
根据本发明一实施例,该方法还包括:将实时用户数据作为新的训练样本数据来调整机器学习模型。
具体地,将实时用户数据作为新的训练样本数据反馈给机器学习模型,训练更新该模型,进一步调整模型参数,进而提高模型的预测准确率。在一个实施例中,以T+1的周期训练更新模型,其中T表示自然日,即,将每个自然日(T)的所有用户登陆的相关数据在该自然日后的第二个自然日(T+1)作为新的训练样本数据进行模型的更新训练,以调整模型参数。在其它实施例中,也可以以任意的时间间隔周期来训练更新模型,例如,可以实时地训练更新,可以每小时训练更新,等等。
本发明实施例提供的验证码的人机识别方法,可以在验证码验证阶段建立一个准确且鲁棒的用户识别模型,进而快速而准确地识别用户类型。在使用XGboost的机器学习模型的实施例中,可以达到95%的预测准确率。
图2是根据本发明另一实施例提供的验证码的人机识别方法的示意性流程图。如图2所示,该方法包括如下内容。
210:收集样本数据集。
具体地,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示与样本数据集对应的第二用户的属性,即第二用户是正常用户还是异常用户。
220:使用样本数据集来训练机器学习模型。
230:采集第一用户输入验证码时的实时用户数据。
具体地,有关实时用户数据和训练样本数据的详细内容,可以参见上述图1中的描述,在此不再赘述。
240:根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性。
250:判断第一用户的属性是否是正常用户,若是正常用户,则执行260,若不是正常用户,即是异常用户,则执行270。
260:允许第一用户登录。
270:阻止第一用户登录。
280:将实时用户数据作为新的训练样本数据来调整机器学习模型。
具体地,280可以在240之后执行,也可以在260和270之后执行,本发明对此不做限制。
根据本发明一实施例,如图3所示,220可以进一步包括如下内容。
221:针对一组或多组训练样本数据中的每组训练样本数据设定对应的标签。
具体地,标签的设定过程可以参见图1中的描述,在此不再赘述。
在一个实施例中,221也可以在220之前执行。
222:对一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征。
具体地,样本特征的获取过程可以参见图1中的描述,在此不再赘述。
223:通过一组或多组样本特征以及分别与每组训练样本数据相对应的标签来确定机器学习模型的参数。
具体地,模型参数的确定过程可以参见图1中的描述,在此不再赘述。
在本实施例中,222可以在221之前执行,也可以在221之后执行。在机器学习模型建立好以后,保存该机器学习模型,即可执行230以及230之后的步骤。
根据本发明一实施例,如图4所示,240可以进一步包括如下内容。
241:对实时用户数据进行特征工程设计。
242:使用先前训练好的机器学习模型来对第一用户进行预测,确定第一用户的属性。
具体地,特征工程设计的方法和所获得的特征类型,以及第一用户的属性的确定过程,可以参见图1中的描述,在此不再赘述。
图5是根据本发明一实施例提供的验证码的人机识别装置500的结构示意图。如图5所示,该装置500包括:采集模块510,用于采集第一用户输入验证码时的实时用户数据;以及预测模块520,用于根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。
本发明实施例提供了一种验证码的人机识别装置,通过利用训练得到的机器 学习模型对验证码验证阶段的实时用户数据进行预测,可以准确地识别用户是否是正常用户,从而对异常用户进行拦截。并且,传统使用的统计模型只能处理较小的数据量和较窄的数据属性,而在本发明实施例中,训练机器学习模型时能够处理更大量的样本数据,这使得相较于传统方法增加了预测的可靠性和准确度。
根据本发明一实施例,训练样本数据包括以下各项中的至少一项:第二用户的行为数据、第二用户的风险数据、第二用户的终端信息数据,实时用户数据包括以下各项中的至少一项:第一用户的行为数据、第一用户的风险数据、第一用户的终端信息数据。
根据本发明一实施例,验证码为滑块验证码,并且,第二用户的行为数据包括第二用户在拖动滑块验证码前后的鼠标移动轨迹数据,第二用户的风险数据包括第二用户的身份数据和/或征信数据,第二用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项,第一用户的行为数据包括第一用户在拖动滑块验证码前后的鼠标移动轨迹数据,第一用户的风险数据包括第一用户的身份数据和/或征信数据,第一用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项。
根据本发明一实施例,第一用户的属性代表第一用户是正常用户还是异常用户。
根据本发明一实施例,装置500还包括:收集模块530,用于收集样本数据集;训练模块540,用于使用样本数据集来训练机器学习模型。
根据本发明一实施例,装置500还包括:调整模块550,用于将实时用户数据作为新的训练样本数据来调整机器学习模型。
根据本发明一实施例,训练模块540用于对一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征,以及通过一组或多组样本特征以及分别与每组训练样本数据相对应的标签来确定机器学习模型的参数。
根据本发明一实施例,预测模块520用于对实时用户数据进行特征工程设计,以获得实时用户特征,以及使用机器学习模型对实时用户特征进行预测。
根据本发明一实施例,机器学习模型为XGboost模型。
图6是根据本发明一示例性实施例示出的用于验证码的人机识别的计算机装置600的框图。
参照图6,装置600包括处理组件610,其进一步包括一个或多个处理器,以及由存储器620所代表的存储器资源,用于存储可由处理组件610的执行的指令,例如应用程序。存储器620中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件610被配置为执行指令,以执行上述验证码的人机识别方法。
装置600还可以包括一个电源组件被配置为执行装置600的电源管理,一个有线或无线网络接口被配置为将装置600连接到网络,和一个输入输出(I/O)接口。装置600可以操作基于存储在存储器620的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
一种非临时性计算机可读存储介质,当存储介质中的指令由上述装置600的处理器执行时,使得上述装置600能够执行一种验证码的人机识别方法,包括:采集第一用户输入验证码时的实时用户数据;以及根据机器学习模型对实时用户数据进行预测,以确定第一用户的属性,机器学习模型是通过训练样本数据集得到的,样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,标签表示第二用户的属性。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以 使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序校验码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种验证码的人机识别方法,包括:
    采集第一用户输入验证码时的实时用户数据;以及
    根据机器学习模型对所述实时用户数据进行预测,以确定所述第一用户的属性,所述机器学习模型是通过训练样本数据集得到的,所述样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,所述标签表示第二用户的属性。
  2. 根据权利要求1所述的方法,其中,所述训练样本数据包括以下各项中的至少一项:所述第二用户的行为数据、所述第二用户的风险数据、所述第二用户的终端信息数据,所述实时用户数据包括以下各项中的至少一项:所述第一用户的行为数据、所述第一用户的风险数据、所述第一用户的终端信息数据。
  3. 根据权利要求2所述的方法,其中,所述验证码为滑块验证码,并且,所述第二用户的行为数据包括所述第二用户在拖动所述滑块验证码前后的鼠标移动轨迹数据,所述第二用户的风险数据包括所述第二用户的身份数据和/或征信数据,所述第二用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项,所述第一用户的行为数据包括所述第一用户在拖动所述滑块验证码前后的鼠标移动轨迹数据,所述第一用户的风险数据包括所述第一用户的身份数据和/或征信数据,所述第一用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项。
  4. 根据权利要求1至3中任一项所述的方法,其中,所述第一用户的属性代表所述第一用户是正常用户还是异常用户。
  5. 根据权利要求1至4中任一项所述的方法,还包括:
    收集所述样本数据集;
    使用所述样本数据集来训练所述机器学习模型。
  6. 根据权利要求5所述的方法,还包括:
    将所述实时用户数据作为新的训练样本数据来调整所述机器学习模型。
  7. 根据权利要求5所述的方法,其中,所述使用所述样本数据集来训练所述机器学习模型,包括:
    对所述一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征;以及
    通过所述一组或多组样本特征以及分别与每组训练样本数据相对应的所述标签来确定所述机器学习模型的参数。
  8. 根据权利要求1至7中任一项所述的方法,其中,所述根据所述机器学习 模型对所述实时用户数据进行预测,包括:
    对所述实时用户数据进行特征工程设计,以获得实时用户特征,使用所述机器学习模型对所述实时用户特征进行预测。
  9. 根据权利要求1至8中任一项所述的方法,其中,所述机器学习模型为XGboost模型。
  10. 一种验证码的人机识别装置,包括:
    采集模块,用于采集第一用户输入验证码时的实时用户数据;以及
    预测模块,用于根据机器学习模型对所述实时用户数据进行预测,以确定所述第一用户的属性,所述机器学习模型是通过训练样本数据集得到的,所述样本数据集包括一组或多组训练样本数据以及分别针对每组训练样本数据设定的标签,所述标签表示第二用户的属性。
  11. 根据权利要求10所述的装置,其中,所述训练样本数据包括以下各项中的至少一项:所述第二用户的行为数据、所述第二用户的风险数据、所述第二用户的终端信息数据,所述实时用户数据包括以下各项中的至少一项:所述第一用户的行为数据、所述第一用户的风险数据、所述第一用户的终端信息数据。
  12. 根据权利要求11所述的装置,其中,所述验证码为滑块验证码,并且,所述第二用户的行为数据包括所述第二用户在拖动所述滑块验证码前后的鼠标移动轨迹数据,所述第二用户的风险数据包括所述第二用户的身份数据和/或征信数据,所述第二用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项,所述第一用户的行为数据包括所述第一用户在拖动所述滑块验证码前后的鼠标移动轨迹数据,所述第一用户的风险数据包括所述第一用户的身份数据和/或征信数据,所述第一用户的终端信息数据包括用户代理数据、设备指纹和IP地址中的至少一项。
  13. 根据权利要求10至12中任一项所述的装置,其中,所述第一用户的属性代表所述第一用户是正常用户还是异常用户。
  14. 根据权利要求10至13中任一项所述的装置,还包括:
    收集模块,用于收集所述样本数据集;
    训练模块,用于使用所述样本数据集来训练所述机器学习模型。
  15. 根据权利要求14所述的装置,还包括:
    调整模块,用于将所述实时用户数据作为新的训练样本数据来调整所述机器学习模型。
  16. 根据权利要求14所述的装置,其中,所述训练模块用于对所述一组或多组训练样本数据中的每组训练样本数据进行特征工程设计,以获得一组或多组样本特征,以及通过所述一组或多组样本特征以及分别与每组训练样本数据相对应的所述标签来确定所述机器学习模型的参数。
  17. 根据权利要求10至16中任一项所述的装置,其中,所述预测模块用于对所述实时用户数据进行特征工程设计,以获得实时用户特征,以及使用所述机器学习模型对所述实时用户特征进行预测。
  18. 根据权利要求10至17中任一项所述的装置,其中,所述机器学习模型为XGboost模型。
  19. 一种计算机设备,包括:
    处理器;
    存储设备,所述存储设备包括存储在其上的计算机指令,所述计算机指令在被所述处理器执行时,使得所述处理器执行权利要求1至9中任一项所述的验证码的人机识别方法。
  20. 一种计算机可读存储介质,包括存储在其上的计算机指令,所述计算机指令在被处理器执行时,使得所述处理器执行权利要求1至9中任一项所述的验证码的人机识别方法。
PCT/CN2019/072354 2018-04-09 2019-01-18 验证码的人机识别方法及装置 WO2019196534A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/392,311 US20190311114A1 (en) 2018-04-09 2019-04-23 Man-machine identification method and device for captcha

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810309762.8A CN108491714A (zh) 2018-04-09 2018-04-09 验证码的人机识别方法
CN201810309762.8 2018-04-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/392,311 Continuation US20190311114A1 (en) 2018-04-09 2019-04-23 Man-machine identification method and device for captcha

Publications (1)

Publication Number Publication Date
WO2019196534A1 true WO2019196534A1 (zh) 2019-10-17

Family

ID=63315257

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/072354 WO2019196534A1 (zh) 2018-04-09 2019-01-18 验证码的人机识别方法及装置

Country Status (2)

Country Link
CN (1) CN108491714A (zh)
WO (1) WO2019196534A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420276A (zh) * 2021-08-20 2021-09-21 北京顶象技术有限公司 基于验证码的风险确定方法、装置、电子设备和存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491714A (zh) * 2018-04-09 2018-09-04 众安信息技术服务有限公司 验证码的人机识别方法
CN109255230A (zh) * 2018-09-29 2019-01-22 武汉极意网络科技有限公司 异常验证行为的识别方法、系统、用户设备及存储介质
CN109361660B (zh) * 2018-09-29 2021-09-03 武汉极意网络科技有限公司 异常行为分析方法、系统、服务器及存储介质
CN109409049A (zh) * 2018-10-10 2019-03-01 北京京东金融科技控股有限公司 用于识别交互操作的方法和装置
CN110059457B (zh) * 2018-11-05 2020-06-30 阿里巴巴集团控股有限公司 一种核身方法及装置
CN109902474B (zh) * 2019-03-01 2020-11-03 北京奇艺世纪科技有限公司 一种滑动验证码中移动对象的移动轨迹的确定方法及装置
CN110046647A (zh) * 2019-03-08 2019-07-23 同盾控股有限公司 一种验证码机器行为识别方法及装置
CN110490632A (zh) * 2019-07-01 2019-11-22 广州阿凡提电子科技有限公司 一种潜在客户识别方法、电子设备及存储介质
CN111062019A (zh) * 2019-12-13 2020-04-24 支付宝(杭州)信息技术有限公司 用户攻击检测方法、装置、电子设备
CN111783063A (zh) * 2020-06-12 2020-10-16 完美世界(北京)软件科技发展有限公司 一种操作的验证方法和装置
CN111897435B (zh) * 2020-08-06 2022-08-02 陈涛 一种人机识别的方法、识别系统、mr智能眼镜及应用

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015049055A1 (de) * 2013-10-04 2015-04-09 Giesecke & Devrient Gmbh Verfahren zum darstellen einer information
CN106155298A (zh) * 2015-04-21 2016-11-23 阿里巴巴集团控股有限公司 人机识别方法及装置、行为特征数据的采集方法及装置
CN107846412A (zh) * 2017-11-28 2018-03-27 五八有限公司 验证码请求处理方法、装置及验证码处理系统
CN108491714A (zh) * 2018-04-09 2018-09-04 众安信息技术服务有限公司 验证码的人机识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015049055A1 (de) * 2013-10-04 2015-04-09 Giesecke & Devrient Gmbh Verfahren zum darstellen einer information
CN106155298A (zh) * 2015-04-21 2016-11-23 阿里巴巴集团控股有限公司 人机识别方法及装置、行为特征数据的采集方法及装置
CN107846412A (zh) * 2017-11-28 2018-03-27 五八有限公司 验证码请求处理方法、装置及验证码处理系统
CN108491714A (zh) * 2018-04-09 2018-09-04 众安信息技术服务有限公司 验证码的人机识别方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420276A (zh) * 2021-08-20 2021-09-21 北京顶象技术有限公司 基于验证码的风险确定方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN108491714A (zh) 2018-09-04

Similar Documents

Publication Publication Date Title
WO2019196534A1 (zh) 验证码的人机识别方法及装置
US20190311114A1 (en) Man-machine identification method and device for captcha
CN108009521B (zh) 人脸图像匹配方法、装置、终端及存储介质
WO2019153604A1 (zh) 人机识别模型的建立装置、方法及计算机可读存储介质
WO2019233421A1 (zh) 图像处理方法及装置、电子设备、存储介质
US10938927B2 (en) Machine learning techniques for processing tag-based representations of sequential interaction events
JP7414901B2 (ja) 生体検出モデルのトレーニング方法及び装置、生体検出の方法及び装置、電子機器、記憶媒体、並びにコンピュータプログラム
CN107193962B (zh) 一种互联网推广信息的智能配图方法及装置
WO2022105118A1 (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
CN109034069B (zh) 用于生成信息的方法和装置
EP3286679A1 (en) Method and system for identifying a human or machine
WO2020238353A1 (zh) 数据处理方法和装置、存储介质及电子装置
US10511681B2 (en) Establishing and utilizing behavioral data thresholds for deep learning and other models to identify users across digital space
WO2022148038A1 (zh) 信息推荐方法及装置
CN110544109A (zh) 用户画像生成方法、装置、计算机设备和存储介质
CN110855648B (zh) 一种网络攻击的预警控制方法及装置
CN107944032B (zh) 用于生成信息的方法和装置
US20190347472A1 (en) Method and system for image identification
WO2019061664A1 (zh) 电子装置、基于用户上网数据的产品推荐方法及存储介质
CN111898561B (zh) 一种人脸认证方法、装置、设备及介质
CN115941322B (zh) 基于人工智能的攻击检测方法、装置、设备及存储介质
CN109377347A (zh) 基于特征选择的网络信用预警方法、系统及电子设备
WO2021175010A1 (zh) 用户性别识别的方法、装置、电子设备及存储介质
CN110414562A (zh) X光片的分类方法、装置、终端及存储介质
CN114238764A (zh) 基于循环神经网络的课程推荐方法、装置及设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019532747

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19785382

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04/02/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19785382

Country of ref document: EP

Kind code of ref document: A1