WO2019019630A1 - Anti-fraud identification method, storage medium, server carrying ping an brain and device - Google Patents

Anti-fraud identification method, storage medium, server carrying ping an brain and device Download PDF

Info

Publication number
WO2019019630A1
WO2019019630A1 PCT/CN2018/077230 CN2018077230W WO2019019630A1 WO 2019019630 A1 WO2019019630 A1 WO 2019019630A1 CN 2018077230 W CN2018077230 W CN 2018077230W WO 2019019630 A1 WO2019019630 A1 WO 2019019630A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
model
feature
fraud
training
Prior art date
Application number
PCT/CN2018/077230
Other languages
French (fr)
Chinese (zh)
Inventor
肖京
王健宗
王建明
徐亮
汪伟
周宝
李想
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019019630A1 publication Critical patent/WO2019019630A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Abstract

Disclosed in the present application is an anti-fraud identification method, used for solving the problem of insufficient anti-fraud capabilities in the medical field. The method provided in the present application comprises: determining a target event; extracting target data related to the target event; and using at least two methods from among a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior to process the target data. Further provided in the present application are a storage medium and a server carrying a Ping An Brain.

Description

反欺诈识别方法、存储介质、承载平安脑的服务器及装置Anti-fraud identification method, storage medium, server and device carrying safe brain
本申请要求于2017年7月24日提交中国专利局、申请号为CN201710605531.7、发明名称为“反欺诈识别方法、存储介质和承载平安脑的服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on July 24, 2017, the Chinese Patent Office, the application number is CN201710605531.7, and the invention name is "anti-fraud identification method, storage medium and server hosting the safe brain". The content is incorporated herein by reference.
技术领域Technical field
本申请涉及医疗领域,尤其涉及反欺诈识别方法、存储介质、承载平安脑的服务器及装置。The present application relates to the field of medical treatment, and in particular, to an anti-fraud identification method, a storage medium, a server and a device for carrying a safe brain.
背景技术Background technique
在医疗领域中,目前往往存在许多欺诈行为,例如药老鼠行为、理赔欺诈行为、非法刷卡报销行为等,这些欺诈行为的存在会浪费医疗资源,激化社会矛盾。In the medical field, there are often many frauds, such as drug-mouse behavior, claims fraud, illegal credit card reimbursement, etc. The existence of these frauds will waste medical resources and intensify social conflicts.
然而,当前并不存在一套完善的方法对这些欺诈行为进行识别,导致医疗领域的反欺诈能力不足,欺诈行为难以得到控制。因此,寻找一种反欺诈方法来进一步提高医疗领域的反欺诈能力成为本领域技术人员亟需解决的问题。However, there is currently no comprehensive method for identifying these fraudulent behaviors, resulting in insufficient anti-fraud capabilities in the medical field and fraudulent control. Therefore, finding an anti-fraud method to further improve the anti-fraud ability in the medical field has become an urgent problem to be solved by those skilled in the art.
发明概述Summary of invention
技术问题technical problem
本申请实施例提供了反欺诈识别方法、存储介质、承载平安脑的服务器及装置,以解决当前并不存在一套完善的方法对欺诈行为进行识别,导致医疗领域的反欺诈能力不足,欺诈行为难以得到控制的问题。The embodiments of the present application provide an anti-fraud identification method, a storage medium, and a server and device for carrying a safe brain, so as to solve the problem that fraudulent behavior is not recognized in the present invention, resulting in insufficient anti-fraud ability in the medical field, and fraudulent behavior. It is difficult to get control problems.
问题的解决方案Problem solution
技术解决方案Technical solution
第一方面,提供了一种反欺诈识别方法,包括:In a first aspect, an anti-fraud identification method is provided, including:
确定目标事件;Identify target events;
提取与所述目标事件相关的目标数据;Extracting target data related to the target event;
采用如下构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理;The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;
所述构建决策模型的方法包括:The method of constructing a decision model includes:
获取规则模板数据,并提取所述规则模板数据中的各个变量对象及各个模板样本;Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;
对所述变量对象进行聚类分析,得到聚类结果;Perform cluster analysis on the variable object to obtain a clustering result;
根据所述规则模板数据将所述聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征;Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;
分别计算各个变量对象的黑样本概率,并将所述各个变量对象的黑样本概率作为第二特征;Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;
通过所述第一特征与所述第二特征构建决策模型;Constructing a decision model by the first feature and the second feature;
所述欺诈数据的识别方法包括:The method for identifying the fraud data includes:
采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;
基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据;Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;
所述欺诈行为的识别方法包括:The method for identifying the fraud behavior includes:
基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;
将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率;Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;
其中,所述目标数据包括待测试数据、规则模板数据和社保就诊数据中的至少两个。The target data includes at least two of data to be tested, rule template data, and social security visit data.
第二方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述的反欺诈识别方法的步骤。In a second aspect, a computer readable storage medium is stored, the computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the anti-fraud identification method described above.
第三方面,提供了一种承载平安脑大数据平台的服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现上述的反欺诈识别方法的步骤。In a third aspect, a server for carrying a Ping An Big Big Data platform is provided, comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein The steps of the anti-fraud identification method described above are implemented when the processor executes the computer readable instructions.
第四方面,提供了一种欺诈数据的识别装置,包括:In a fourth aspect, an apparatus for identifying fraud data is provided, including:
建模模块,用于采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The modeling module is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;
识别模块,用于基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据。And an identification module, configured to perform training on the test data based on the continuous anti-fraud model, and identify fraud data in the data to be tested.
发明的有益效果Advantageous effects of the invention
有益效果Beneficial effect
本申请实施例中,针对目标事件的目标数据,采用构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理,可以更全面、完善地对医疗领域的事件进行反欺诈决策和识别。In the embodiment of the present application, for the target data of the target event, the target data is processed by using at least two methods of constructing a decision model, a method for identifying fraud data, and a method for identifying a fraud behavior, which may be more comprehensive and complete. Anti-fraud decision-making and identification of incidents in the medical field.
对附图的简要说明Brief description of the drawing
附图说明DRAWINGS
图1为本申请实施例中一种反欺诈识别方法的流程图;1 is a flowchart of an anti-fraud identification method in an embodiment of the present application;
图2为一个实施例中构建决策模型的方法的流程示意图;2 is a schematic flow chart of a method for constructing a decision model in an embodiment;
图3为另一个实施例中构建决策模型的方法的流程示意图;3 is a schematic flow chart of a method for constructing a decision model in another embodiment;
图4为一个实施例中如何构建决策模型的流程示意图;4 is a schematic flow chart of how to construct a decision model in an embodiment;
图5为一个实施例中对变量对象进行聚类分析的流程示意图;FIG. 5 is a schematic flowchart of cluster analysis of a variable object in an embodiment; FIG.
图6为一个实施例中构建决策模型的装置的结构示意图;6 is a schematic structural diagram of an apparatus for constructing a decision model in an embodiment;
图7为本申请欺诈数据的识别方法第一实施例的流程示意图;7 is a schematic flowchart of a first embodiment of a method for identifying fraud data according to the present application;
图8为本申请欺诈数据的识别方法第二实施例的流程示意图;8 is a schematic flowchart of a second embodiment of a method for identifying fraud data according to the present application;
图9为本申请欺诈数据的识别装置第一实施例的功能模块示意图;9 is a schematic diagram of functional modules of a first embodiment of a device for identifying fraud data according to the present application;
图10为本申请社保欺诈行为的识别方法第一实施例的流程示意图;10 is a schematic flowchart of a first embodiment of a method for identifying social security fraud behavior according to the present application;
图11为图10中步骤S10的细化流程示意图;11 is a schematic flowchart of the refinement of step S10 in FIG. 10;
图12为图10中步骤S30的细化流程示意图;FIG. 12 is a schematic diagram showing the refinement process of step S30 in FIG. 10;
图13为本申请社保欺诈行为的识别方法第二实施例的流程示意图;13 is a schematic flowchart of a second embodiment of a method for identifying social security fraud behavior according to the present application;
图14为本申请社保欺诈行为的识别装置第一实施例的功能模块示意图;14 is a schematic diagram of functional modules of a first embodiment of an apparatus for identifying social security fraud behavior according to the present application;
图15为本申请关系网络的较佳示意图;15 is a schematic diagram of a relationship network of the present application;
图16为本申请一实施例提供的承载平安脑大数据平台的服务器的示意图。FIG. 16 is a schematic diagram of a server carrying a Ping An Brain Big Data platform according to an embodiment of the present application.
发明实施例Invention embodiment
本发明的实施方式Embodiments of the invention
本申请实施例提供了反欺诈识别方法、存储介质、承载平安脑的服务器及装置,用于更全面、完善地对医疗领域的事件进行反欺诈决策和识别。The embodiment of the present application provides an anti-fraud identification method, a storage medium, a server and a device for carrying a safe brain, and is used for more comprehensive and comprehensive anti-fraud decision and identification of events in the medical field.
本发明提供的平安脑大数据平台利用集团金融及非金融领域的大数据资源,结合自身技术和平台优势,通过国际前沿的数据挖掘、机器学习、深度学习等大数据技术,对各种结构化和非结构化数据进行精细化分类管理,挖掘数据价值。The Ping An Brain Big Data Platform provided by the present invention utilizes the big data resources of the group financial and non-financial fields, combines the advantages of its own technology and platform, and implements various structuralization through international advanced data mining, machine learning, deep learning and other big data technologies. Refine and classify data with unstructured data to mine data value.
如图1所示,本申请提供的一种反欺诈识别方法,包括:101、确定目标事件;102、提取与所述目标事件相关的目标数据;103、采用如下构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理。其中,该目标数据包括待测试数据、规则模板数据和社保就诊数据中的至少两个。As shown in FIG. 1 , an anti-fraud identification method provided by the present application includes: 101: determining a target event; 102, extracting target data related to the target event; 103, adopting a method for constructing a decision model, and fraud data as follows The target data is processed by at least two of the identification method and the fraud detection method. The target data includes at least two of data to be tested, rule template data, and social security visit data.
实施例一:Embodiment 1:
请参阅图2,本申请实施例中一种构建决策模型的方法一个实施例包括:Referring to FIG. 2, an embodiment of a method for constructing a decision model in an embodiment of the present application includes:
步骤S110,获取规则模板数据,并提取规则模板数据中的各个变量对象及各个模板样本。Step S110: Obtain rule template data, and extract each variable object and each template sample in the rule template data.
具体的,规则模板指的是用于帮助确定审核结果的一套标准,一个单据或是项目的审核可能对应一个或多个规则模板,例如,审核贷款人信用度,可包括“货款人在哪几个分行进行贷款”、“贷款人曾在哪个机关机构有过不良记录”等规则模板。每个不同的规则模板均有其对应的规则模板数据,其中,规则模板数据中可包括各个变量对象、各个模板样本,以及变量对象与模板样本之间的匹配关系,变量对象为定性类型的变量,每个变量对象对应规则模板中一个不同的类别,例如,规则模板为“货款人在哪几个分行进行贷款”,对应的规则模板数据 可包括用户1在A分行进行贷款、用户2在B分行进行贷款、用户3在C分行进行贷款……,其中,A分行、B分行、C分行等各个分行即为变量对象,用户1、用户2、用户3等即为模板样本。Specifically, the rule template refers to a set of criteria used to help determine the results of the audit. A document or project review may correspond to one or more rule templates. For example, reviewing the lender's credit rating may include “what is the payer? Rule templates such as “sub-bank lending” and “who has a bad record in the lender’s organization”. Each of the different rule templates has its corresponding rule template data, wherein the rule template data may include each variable object, each template sample, and a matching relationship between the variable object and the template sample, and the variable object is a qualitative type variable. Each variable object corresponds to a different category in the rule template. For example, the rule template is “Which branches are in which the user pays the loan”, and the corresponding rule template data may include the user 1 performing the loan at the A branch and the user 2 at the B. The branch conducts the loan, and the user 3 makes the loan at the C branch. Among them, each branch such as A branch, B branch, and C branch is a variable object, and user 1, user 2, and user 3 are template samples.
步骤S120,对变量对象进行聚类分析,得到聚类结果。In step S120, cluster analysis is performed on the variable object to obtain a clustering result.
具体的,可提取各个变量对象的多维数据,并根据多维数据对变量对象进行聚类分析,多维数据指的是变量对象各个维度的相关数据,例如,变量对象为各个分行,多维数据可包括各个分行的总贷款人数、总贷款量、贷款平均周期、分行规模、地理位置等。聚类分析指的是将物理或抽象对象的集合分组为由类似的对象组成的多个类的分析过程,通过对变量对象进行聚类分析,可将相似或相近的变量对象进行聚类,可降低变量对象的层级。例如,变量对象包括A分行、B分行、C分行、D分行……,对变量对象进行聚类分析,A分行与B分析较为相似,分到A组,C分行与D分行较为相似,分到B组……,变量对象由原来的各个分行的层级降为各个组别的层级。对变量对象进行聚类分析后,可得到由各个聚类组成的聚类结果。Specifically, the multi-dimensional data of each variable object can be extracted, and the variable object is clustered and analyzed according to the multi-dimensional data, and the multi-dimensional data refers to related data of each dimension of the variable object, for example, the variable object is each branch, and the multi-dimensional data can include each The total number of loans, the total loan amount, the average loan period, the branch size, and the geographical location of the branch. Cluster analysis refers to the analysis process of grouping physical or abstract objects into multiple classes consisting of similar objects. Clustering analysis of variable objects can cluster similar or similar variable objects. Reduce the level of the variable object. For example, variable objects include A branch, B branch, C branch, D branch..., cluster analysis of variable objects, A branch is similar to B analysis, and is divided into group A, C branch is similar to D branch, and is assigned to Group B..., the variable object is reduced from the level of each branch to the level of each group. After clustering the variable objects, clustering results composed of individual clusters can be obtained.
步骤S130,根据规则模板数据将聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征。Step S130, matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature.
具体的,对变量对象进行聚类分析,得到聚类结果后,可根据规则模板数据中变量对象与模板样本的匹配关系将聚类结果与各个模板样本进行匹配。例如,规则模板为“贷款人曾在哪个相关机构有过不良记录”,规则模板数据包括用户1曾在FK机构有过不良记录、用户2曾在CE机构有过不良记录、用户3曾在KD机构有过不良记录……,对变量对象FK机构、CE机构、KD机构……进行聚类分析,得到分别以组A、组B、组C……命名的各个聚类,并将聚类结果与模板样本用户1、用户2、用户3……进行匹配。可如下表所示,表1表示规则模板数据中变量对象与模板样本的匹配关系,表2表示聚类结果与各个模板样本的匹配关系,可用“1”表示变量对象与模板样本或聚类结果的匹配关系,但不限于此。Specifically, the clustering analysis is performed on the variable object, and after the clustering result is obtained, the clustering result can be matched with each template sample according to the matching relationship between the variable object and the template sample in the rule template data. For example, the rule template is “Which relevant institution has had a bad record in the relevant lender”, the rule template data includes that user 1 had a bad record in the FK organization, user 2 had a bad record in the CE organization, and user 3 had been in KD. The organization has had a bad record..., clustering the variable object FK organization, CE organization, KD organization..., and obtaining each cluster named after group A, group B, group C... and clustering results Matches with template sample user 1, user 2, user 3.... Table 1 shows the matching relationship between the variable object and the template sample in the rule template data. Table 2 shows the matching relationship between the clustering result and each template sample. The "1" can be used to represent the variable object and the template sample or clustering result. Matching relationship, but not limited to this.
表1Table 1
[Table 1][Table 1]
  FKFK CECE KDKD ……......
用户1User 1 11 00 00 ……......
用户2User 2 00 11 00 ……......
用户3User 3 00 11 11 ……......
……...... ……...... ……...... ……...... ……......
表2Table 2
[Table 2][Table 2]
  组AGroup A 组BGroup B 组CGroup C ……......
用户1User 1 11 00 00 ……......
用户2User 2 11 00 00 ……......
用户3User 3 00 11 00 ……......
……...... ……...... ……...... ……...... ……......
通过对变量对象进行聚类分析,明显可降低变量对象的层级,有利于建模。By clustering the variable objects, it is obvious that the level of the variable object can be reduced, which is conducive to modeling.
步骤S140,分别计算各个变量对象的黑样本概率,并将各个变量对象的黑样本概率作为第二特征。In step S140, the black sample probabilities of the respective variable objects are respectively calculated, and the black sample probabilities of the respective variable objects are taken as the second features.
具体的,决策模型通常的输出结果为黑样本或白样本,黑样本指的是不通过审核的样本,白样本则指的是通过审核的样本,例如决策模型用于银行贷款资质审查,黑样本则指的是不通过贷款资质审查的用户,白样本则指的是通过贷款资质审查的用户。分别计算各个变量对象的黑样本概率,即对于各个变量对象在规则模板数据中,模板样本的结果类型为黑样本的概率占比多少,例如,规则模板为“贷款人曾在哪个相关机构有过不良记录”,则可计算“在KD机构有过不良记录的用户最终为黑样本的概率是多少”等。变量对象的黑样本概率的计算公式可为:黑样本概率=该变量对象的黑样本个数/该变量对象的模板样本总数。可将计算得到的各个变量对象的黑样本概率以连续型变量的形式作为第二特征。 在其它的实施例中,也可分别计算各个变量对象的WOE(weight-of-evidence,证据权重)值,其计算公式为WOE=ln(该变量对象的黑样本个数占总的黑样本个数的比例/该变量对象的白样本个数占总的白样本个数的比例),WOE值越高,则表示该变量对象的模板样本是黑样本的概率越低。Specifically, the output of the decision model is usually a black sample or a white sample, the black sample refers to a sample that fails the audit, and the white sample refers to a sample that passes the audit, such as a decision model for bank loan qualification review, a black sample. It refers to users who do not pass the loan qualification review, while the white sample refers to users who pass the loan qualification review. Calculate the probability of the black sample of each variable object separately, that is, the probability ratio of the result type of the template sample to the black sample in the rule template data for each variable object. For example, the rule template is “Which relevant institution the lender has ever had The "bad record" can be calculated as "the probability that the user who has a bad record in the KD organization will eventually be a black sample". The calculation formula of the black sample probability of the variable object can be: black sample probability = the number of black samples of the variable object / the total number of template samples of the variable object. The calculated black sample probability of each variable object can be taken as the second feature in the form of a continuous variable. In other embodiments, the WOE (weight-of-evidence) value of each variable object may also be calculated separately, and the calculation formula is WOE=ln (the number of black samples of the variable object accounts for the total black sample) The ratio of the number / the number of white samples of the variable object to the total number of white samples), the higher the WOE value, the lower the probability that the template sample of the variable object is a black sample.
步骤S150,通过第一特征与第二特征构建决策模型。Step S150, constructing a decision model by using the first feature and the second feature.
具体的,目前构建决策模型的方式是将所有的规则模板数据输入进行建模,规则模板数据多且层级复杂,不利于建模且会对模型的表现产生影响。通过将匹配后的聚类结果作为第一特征,将各个变量对象的黑样本概率作为第二特征,替代原有的规则模板数据输入进行构建决策模型,不仅降低了数据涉及的层级,且保留了各个变量对象对决策结果的影响,使决策结果更为准确。决策模型可包括决策树、GBDT(Gradient Boosting Decision Tree)树模型、LDA(Linear Discriminant Analysis,线性判别式分析)模型等机器学习模型。当构建某个单据或是项目的审核决策模型时,可能对应一个或多个规则模板,则需分别得到各个规则模板对应的第一特征、第二特征,并替代原有的规则模板数据输入构建决策模型,当某些规则模板中的变量对象少时,可直接输入规则模板数据构建模型。Specifically, the current way to construct a decision model is to model all the rule template data input. The rule template data is multi-layered and complex, which is not conducive to modeling and will affect the performance of the model. By using the matched clustering result as the first feature, the black sample probability of each variable object is used as the second feature, instead of the original rule template data input to construct the decision model, which not only reduces the level involved in the data, but also retains The influence of each variable object on the decision result makes the decision result more accurate. The decision model may include a machine learning model such as a decision tree, a GBD (Gradient Boosting Decision Tree) tree model, and an LDA (Linear Discriminant Analysis) model. When constructing a document or an audit decision model of a project, it may correspond to one or more rule templates, and then obtain the first feature and the second feature corresponding to each rule template separately, and replace the original rule template data input to construct Decision model, when there are few variable objects in some rule templates, you can directly input the rule template data to build the model.
上述构建决策模型的方法,通过提取规则模板数据中的各个变量对象及各个模板样本,对变量对象进行聚类分析,得到聚类结果,并根据规则模板数据将聚类结果与各个模板样本进行匹配,匹配后的聚类结果作为第一特征,分别计算各个变量对象的黑样本概率,并将各个变量对象的黑样本概率作为第二特征,再通过第一特征与第二特征构建决策模型,通过对变量对象进行聚类分析,能降低数据涉及的维度及层级,有利于构建决策模型且减少对模型的表现的影响。此外,通过第一特征与第二特征构建的决策模型,使模型的表现更为准确,能有效帮助快速处理需要进行复杂规则审核的业务,提高决策效率。The above method for constructing a decision model extracts the variable objects by clustering each variable object and each template sample in the rule template data to obtain a clustering result, and matches the clustering result with each template sample according to the rule template data. The matched clustering result is used as the first feature to calculate the black sample probability of each variable object, and the black sample probability of each variable object is taken as the second feature, and then the decision model is constructed by using the first feature and the second feature. Cluster analysis of variable objects can reduce the dimensions and levels involved in the data, which is conducive to constructing decision models and reducing the impact on the performance of the model. In addition, the decision model constructed by the first feature and the second feature makes the model more accurate, and can effectively help to quickly process services that require complex rule review, and improve decision efficiency.
如图3所示,在一个实施例中,上述构建决策模型的方法,还包括:As shown in FIG. 3, in an embodiment, the foregoing method for constructing a decision model further includes:
步骤S210,按照预设算法将各个变量对象映射到预先定义的标签中。Step S210: Mapping each variable object into a predefined tag according to a preset algorithm.
具体的,可预先定义各个标签,并将变量对象映射到预先定义的标签中,预设算法可包括哈希方程,例如MD5(Message-Digest Algorithm 5,消息摘要算法第 五版)、SHA(Secure Hash Algorithm,安全哈希算法)等,但不限于此。按照预设算法将各个变量对象映射到预先定义的标签中,例如,变量对象为A分行、B分行、C分行……,利用SHA算法将A分行及C分行映射到标签A中,将B分行映射到标签K中等,标签的个数可根据实际情况进行设定,一个标签下不会包含过多的变量对象,既能降低数据涉及的维度和层级,也能保留原有的一部分信息。Specifically, each tag may be pre-defined, and the variable object is mapped into a predefined tag, and the preset algorithm may include a hash equation, such as MD5 (Message-Digest Algorithm 5, message digest algorithm fifth edition), SHA (Secure Hash Algorithm, etc., but not limited to this. According to the preset algorithm, each variable object is mapped to a predefined tag. For example, the variable object is A branch, B branch, C branch..., and the A branch and the C branch are mapped to the label A by the SHA algorithm, and the B branch is used. Mapped to the label K, the number of labels can be set according to the actual situation. A label does not contain too many variable objects, which can reduce the dimensions and levels involved in the data, and retain some of the original information.
步骤S220,根据规则模板数据将标签与各个模板样本进行匹配,并将匹配后的标签作为第三特征。Step S220: Match the label with each template sample according to the rule template data, and use the matched label as the third feature.
具体的,可根据规则模板数据中变量对象与模板样本的匹配关系将标签与各个模板样本进行匹配,并将匹配后的标签作为第三特征进行建模。Specifically, the label is matched with each template sample according to the matching relationship between the variable object and the template sample in the rule template data, and the matched label is modeled as the third feature.
步骤S230,通过第一特征、第二特征及第三特征构建决策模型。Step S230, constructing a decision model by using the first feature, the second feature, and the third feature.
具体的,将匹配后的聚类结果作为第一特征,将各个变量对象的黑样本概率作为第二特征,将匹配后的标签作为第三特征,并将第一特征、第二特征及第三特征替代所有的规则模板数据输入进行构建决策模型,不仅降低了数据涉及的层级,且保留了各个变量对象对决策结果的影响,使决策结果更为准确。Specifically, the matched clustering result is used as the first feature, the black sample probability of each variable object is taken as the second feature, and the matched tag is used as the third feature, and the first feature, the second feature, and the third feature are The feature replaces all the rule template data input to construct the decision model, which not only reduces the level of data involved, but also preserves the influence of each variable object on the decision result, making the decision result more accurate.
上述构建决策模型的方法,通过第一特征、第二特征及第三特征构建决策模型,通过对变量对象进行聚类分析及映射至预先定义的标签,能降低数据涉及的维度及层级,有利于构建决策模型且减少对模型的表现的影响,能使模型的表现更为准确,能有效帮助快速处理需要进行复杂规则审核的业务,提高决策效率。The above method for constructing a decision model constructs a decision model through the first feature, the second feature, and the third feature, and by clustering the variable object and mapping to a predefined tag, the dimension and level involved in the data can be reduced, which is beneficial to Constructing a decision-making model and reducing the impact on the performance of the model can make the model's performance more accurate, and can effectively help to quickly process the business that requires complex rule review and improve decision-making efficiency.
如图4所示,在一个实施例中,步骤S230通过第一特征、第二特征及第三特征构建决策模型,包括以下步骤:As shown in FIG. 4, in an embodiment, step S230 constructs a decision model by using the first feature, the second feature, and the third feature, including the following steps:
步骤S302,建立原始节点。In step S302, the original node is established.
具体的,在本实施例中,决策模型可为决策树模型,可先建立决策树的原始节点。Specifically, in this embodiment, the decision model may be a decision tree model, and the original node of the decision tree may be established first.
步骤S304,根据规则模板数据获取各个模板样本的结果类型。Step S304: Acquire a result type of each template sample according to the rule template data.
具体的,模板样本的结果类型指的是模板样本的最终结果,例如黑样本、白样本等,从规则模板数据中可获取各个模板样本的结果类型。Specifically, the result type of the template sample refers to the final result of the template sample, such as a black sample, a white sample, etc., and the result type of each template sample can be obtained from the rule template data.
步骤S306,分别遍历读取第一特征、第二特征及第三特征,生成读取记录。Step S306, traversing and reading the first feature, the second feature, and the third feature, respectively, to generate a read record.
具体的,分别遍历读取第一特征、第二特征及第三特征,生成读取记录,即分别遍历每一个可能的决策树分支,例如分别遍历读取第一特征,并生成用户1在组A有过不良贷款记录、用户2在组A有过不良贷款记录……的读取记录,分别遍历读取第二特征,并生成FK机构的黑样本概率为20%、CE机构的黑样本概率为15%……的读取记录等,每条读取记录均可能是决策树的一个分支。Specifically, the first feature, the second feature, and the third feature are separately traversed to generate a read record, that is, each possible decision tree branch is traversed separately, for example, the first feature is traversed separately, and the user 1 is generated in the group. A has a non-performing loan record, user 2 has a non-performing loan record in group A... reading records, traversing the second feature, and generating a black sample probability of FK is 20%, the black sample probability of the CE organization For a read record of 15%, etc., each read record may be a branch of the decision tree.
步骤S308,根据各个模板样本的结果类型计算各条读取记录的分割纯度,并根据分割纯度确定分割点。Step S308, calculating the segmentation purity of each piece of the read record according to the result type of each template sample, and determining the segmentation point according to the segmentation purity.
具体的,可通过计算基尼不纯度、熵、信息增益等来确定各条读取记录的分割纯度,其中,基尼不纯度指的是将来自集合中的某种结果随机应用于集合中某一数据项的预期误差率,熵用于度量系统的混乱程度,信息增益则用来衡量一条读取记录区分模板样本的能力。计算各条读取记录的分割纯度可解释为若是按该读取记录划分模板样本,则预测得到的结果类型与真实的结果类型的差异有多大,差异越小,分割纯度越大,表示该条读取记录越纯。例如,基尼不纯度的计算公式可为:Specifically, the segmentation purity of each read record can be determined by calculating Gini impurity, entropy, information gain, etc., wherein Gini impurity refers to randomly applying a certain result from the set to a certain data in the set. The expected error rate of the term, entropy is used to measure the degree of confusion of the system, and the information gain is used to measure the ability of a read record to distinguish template samples. Calculating the segmentation purity of each read record can be interpreted as if the template sample is divided according to the read record, then the difference between the predicted result type and the real result type is large, and the smaller the difference, the greater the segmentation purity, indicating the strip The purer the read record. For example, the formula for calculating the purity of Gini can be:
Gini=1-[P(1) 2+P(2) 2+……+P(i) 2+……+P(m) 2] Gini=1-[P(1) 2 +P(2) 2 +...+P(i) 2 +...+P(m) 2 ]
则分割纯度=1-基尼不纯度,其中,i∈{1,2,……,m}是指决策模型的m种最终结果,P(i)则是模板样本在使用该读取记录作为判断条件时的结果类型为该种最终结果的比例。Then the purity of the partition = 1 - Gini is not pure, wherein i ∈ {1, 2, ..., m} refers to the m end result of the decision model, and P (i) is the template sample is judged by using the read record The result type at the time of the condition is the ratio of the final result of the species.
可按照各条读取记录的分割纯度的大小确定最佳分割点,分割纯度越大的读取条件优先作为分支,对原始节点进行分割。The optimal segmentation point can be determined according to the size of the segmentation purity of each read record, and the read condition with the higher segmentation purity is prioritized as a branch, and the original node is segmented.
步骤S310,获取与分割点对应的特征,并建立新的节点。Step S310, acquiring features corresponding to the segmentation points, and establishing a new node.
具体的,可获取与分割点对应的特征,并建立新的节点,例如,对各条读取记录计算分割纯度,可得到拥有最大分割纯度的读取记录为“用户1在组A有过不良贷款记录”,则可将原始节点分割成两个分支,一条为在组A有过不良贷款记录,另一条为在组A没有过不良贷款记录,并生成对应的节点,再对新的节点分别寻找下一个分割点,进行分割,直至所有的读取记录被添加到决策树中。构建完决策树模型后,可对决策树进行修剪,剪除分割纯度小于预设的纯度值的读 取记录对应的节点,使决策树的每个分支都具有较高的分割纯度。在其它的实施例中,也可先设定决策树的节点数量,当决策树的节点数量达到该设定的节点数量时,即停止构建决策树。Specifically, the feature corresponding to the segmentation point can be acquired, and a new node is created. For example, the segmentation purity is calculated for each read record, and the read record having the maximum segmentation purity can be obtained as “user 1 has a bad condition in group A. The loan record can divide the original node into two branches, one has a bad loan record in group A, the other is that there is no bad loan record in group A, and the corresponding node is generated, and then the new node is separated. Look for the next split point and split until all read records are added to the decision tree. After the decision tree model is constructed, the decision tree can be pruned, and the nodes corresponding to the read records whose purity is less than the preset purity value are cut off, so that each branch of the decision tree has a high segmentation purity. In other embodiments, the number of nodes in the decision tree may also be set first. When the number of nodes in the decision tree reaches the set number of nodes, the decision tree is stopped.
上述构建决策模型的方法,分别遍历读取第一特征、第二特征及第三特征,生成读取记录,并根据各个模板样本的结果类型计算各条读取记录的分割纯度,根据分割纯度的大小确定分割点,构建决策模型,能使模型的表现更为准确,能有效帮助快速处理需要进行复杂规则审核的业务,提高决策效率。The method for constructing a decision model traverses the first feature, the second feature, and the third feature, respectively, to generate a read record, and calculates the segmentation purity of each read record according to the result type of each template sample, according to the purity of the segmentation. Size determination of the segmentation point, construction of the decision model, can make the model's performance more accurate, can effectively help quickly deal with the business that requires complex rule review, and improve decision-making efficiency.
如图5所示,在一个实施例中,步骤S120对变量对象进行聚类分析,得到聚类结果,包括:As shown in FIG. 5, in an embodiment, step S120 performs cluster analysis on the variable object to obtain a clustering result, including:
步骤S402,从变量对象中随机选择多个变量对象分别作为聚类的第一聚类中心。Step S402, randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster.
具体的,可从所有的变量对象中随机选择多个变量对象,并将选择的每个变量对象分别作为各个聚类的第一聚类中心,并分别对各个聚类进行命名,每个第一聚类中心对应一个聚类,也即聚类的个数与选择的变量对象的个数相同。Specifically, a plurality of variable objects can be randomly selected from all the variable objects, and each of the selected variable objects is respectively used as a first cluster center of each cluster, and each cluster is named separately, each first The cluster center corresponds to one cluster, that is, the number of clusters is the same as the number of selected variable objects.
步骤S404,分别计算各个变量对象到各个第一聚类中心的距离。In step S404, the distances of the respective variable objects to the respective first cluster centers are respectively calculated.
在一个实施例中,步骤S404分别计算各个变量对象到各个第一聚类中心的距离,包括:In one embodiment, step S404 separately calculates the distances of the respective variable objects to the respective first cluster centers, including:
(a)根据规则模板数据获取各个变量对象的多维数据。(a) Obtain multidimensional data of each variable object based on the rule template data.
具体的,可从规则模板数据中获取各个变量对象的多维数据,多维数据指的是变量对象各个维度的相关数据,例如,变量对象为各个分行,多维数据可包括各个分行的总贷款人数、总贷款量、贷款平均周期、分行规模、地理位置等。Specifically, the multi-dimensional data of each variable object can be obtained from the rule template data, and the multi-dimensional data refers to related data of each dimension of the variable object, for example, the variable object is each branch, and the multi-dimensional data can include the total number of loans of each branch, and the total amount Loan volume, average loan period, branch size, geographic location, etc.
(b)根据各个变量对象的多维数据分别计算各个变量对象到各个第一聚类中心的距离。(b) Calculating the distances of the respective variable objects to the respective first cluster centers according to the multidimensional data of the respective variable objects.
具体的,根据获取的各个变量对象的多维数据,可利用欧式距离、余弦相似度等公式计算两个变量对象之间的距离,分别计算各个变量对象到各个第一聚类中心的距离,例如,共有4个聚类,分别对应有4个第一聚类中心,则分别计算各个变量对象到第1个第一聚类中心的距离、到第2个第一聚类中心的距离……。Specifically, according to the obtained multi-dimensional data of each variable object, the distance between the two variable objects can be calculated by using a formula such as Euclidean distance and cosine similarity, and the distances of the respective variable objects to the respective first cluster centers are respectively calculated, for example, There are 4 clusters, and there are 4 first cluster centers respectively, and the distance from each variable object to the first first cluster center and the distance to the second first cluster center are respectively calculated.
步骤S406,根据计算结果对各个变量对象进行划分,将变量对象划分到距离最短的第一聚类中心对应的聚类中。Step S406, dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance.
具体的,分别计算各个变量对象到各个第一聚类中心的距离后,可将变量对象划分到距离最短的第一聚类中心对应的聚类中。在其它的实施例中,也可将计算得到的距离与预设的距离阈值比较,当变量对象与某第一聚类中心距离小于该距离阈值时,则将变量对象划分到该第一聚类中心对应的聚类中。Specifically, after calculating the distances of the respective variable objects to the respective first cluster centers, the variable objects may be divided into clusters corresponding to the first cluster center with the shortest distance. In other embodiments, the calculated distance may also be compared with a preset distance threshold. When the distance between the variable object and a certain cluster center is less than the distance threshold, the variable object is divided into the first cluster. The center corresponds to the cluster.
步骤S408,分别计算划分后的各个聚类的第二聚类中心。Step S408, respectively calculating the second cluster centers of the divided clusters.
具体的,划分完成后,每个聚类均可包括一个或多个变量对象,可利用均值公式重新计算各个聚类的第二聚类中心,重新选定各个聚类的中心。Specifically, after the division is completed, each cluster may include one or more variable objects, and the second cluster center of each cluster may be recalculated by using the mean value formula, and the center of each cluster is reselected.
步骤S410,判断各个聚类中的第一聚类中心与第二聚类中心的距离是否小于预设阈值,若是,则执行步骤S414,若否,则执行步骤S412。In step S410, it is determined whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold. If yes, step S414 is performed, and if no, step S412 is performed.
具体的,计算各个聚类的第一聚类中心与第二聚类中心的距离,并判断距离是否小于预设阈值,若是所有聚类的第一聚类中心与第二聚类中心的距离均小于预设阈值,说明每个聚类趋于稳定,不再发生变化,则可将各个聚类作为聚类结果输出。若聚类的第一聚类中心与第二聚类中心的距离不小于预设阈值,则需要重新对各个聚类的变量对象进行划分。Specifically, calculating a distance between a first cluster center and a second cluster center of each cluster, and determining whether the distance is less than a preset threshold, if the distance between the first cluster center and the second cluster center of all clusters is Less than the preset threshold, indicating that each cluster tends to be stable and no longer changes, each cluster can be output as a clustering result. If the distance between the first cluster center of the cluster and the second cluster center is not less than a preset threshold, the variable objects of the respective clusters need to be re-divided.
步骤S412,将第二聚类中心替代对应的聚类的第一聚类中心,并继续执行步骤S404。Step S412, replacing the second cluster center with the first cluster center of the corresponding cluster, and continuing to perform step S404.
具体的,若聚类的第一聚类中心与第二聚类中心的距离不小于预设阈值,则将该聚类的第二聚类中心替代第一聚类中心,并重新执行分别计算各个变量对象到各个第一聚类中心的距离的步骤,重复执行步骤S404至S412,直至每个聚类趋于稳定,不再发生变化。Specifically, if the distance between the first cluster center of the cluster and the second cluster center is not less than a preset threshold, the second cluster center of the cluster is replaced by the first cluster center, and each calculation is performed separately. Steps S404 to S412 are repeatedly performed in the step of changing the distance of the variable object to each of the first cluster centers until each cluster tends to be stable and no change occurs.
步骤S414,将各个聚类作为聚类结果输出。In step S414, each cluster is output as a clustering result.
上述构建决策模型的方法,对变量对象进行聚类分析,将相似的变量对象合并在一个聚类中,可减少数据涉及的层级,有利于构建决策模型。The above method of constructing a decision model, clustering the variable objects, and merging similar variable objects into one cluster can reduce the level of data involved and facilitate the construction of the decision model.
实施例二:Embodiment 2:
如图6所示,一种构建决策模型的装置,包括提取模块510、聚类模块520、第一特征模块530、第二特征模块540及构建模块550。As shown in FIG. 6, an apparatus for constructing a decision model includes an extraction module 510, a clustering module 520, a first feature module 530, a second feature module 540, and a building module 550.
提取模块510,用于获取规则模板数据,并提取规则模板数据中的各个变量对象及各个模板样本。The extraction module 510 is configured to acquire rule template data, and extract each variable object and each template sample in the rule template data.
聚类模块520,用于对变量对象进行聚类分析,得到聚类结果。The clustering module 520 is configured to perform cluster analysis on the variable object to obtain a clustering result.
第一特征模块530,用于根据规则模板数据将聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征。The first feature module 530 is configured to match the clustering result with each template sample according to the rule template data, and use the matched clustering result as the first feature.
第二特征模块540,分别计算各个变量对象的黑样本概率,并将各个变量对象的黑样本概率作为第二特征。The second feature module 540 calculates the black sample probability of each variable object, and takes the black sample probability of each variable object as the second feature.
构建模块550,用于通过第一特征与第二特征构建决策模型。The building module 550 is configured to construct a decision model by using the first feature and the second feature.
实施例三:Embodiment 3:
请参阅图7,本申请实施例中一种欺诈数据的识别方法一个实施例包括:Referring to FIG. 7, an embodiment of a method for identifying fraud data in an embodiment of the present application includes:
步骤K10,采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;Step K10: training the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;
本实施例中,首先采用预设的连续型模型训练方式,结合决策树、随机森林等数据分析理论以及R、SAS等数据分析工具,对预设的训练数据集进行训练来建立连续型反欺诈模型。如可将预设的训练数据集分为多个组,分别进行训练和中间测试,以建立连续型反欺诈模型。在利用预设的连续型模型训练方式来进行训练时,在一种实施方式中,可将预设的训练数据集分为多个组,分别在每一组中进行模型训练及测试,每一组的训练结果相对独立,互不影响,再将每一组经训练、测试后得到的模型进行整合,得到最终的连续型反欺诈模型。In this embodiment, the preset continuous model training method is firstly combined with data analysis theory such as decision tree and random forest, and data analysis tools such as R and SAS, and the preset training data set is trained to establish continuous anti-fraud. model. For example, the preset training data set can be divided into multiple groups, and training and intermediate tests are respectively performed to establish a continuous anti-fraud model. In the implementation of the training by using the preset continuous model training mode, in one embodiment, the preset training data set may be divided into multiple groups, and the model training and testing are performed in each group respectively. The training results of the group are relatively independent and do not affect each other. Then the models obtained after training and testing are integrated to obtain the final continuous anti-fraud model.
在另一种实施方式中,可将预设的训练数据集分为多个组,依次对每一组进行模型训练及测试,将上一组模型训练及测试的结果作为下一组模型训练及测试的基础,即上下两组的训练结果相互关联,在整个训练过程中,模型能得到不断的优化、改进,得到最终的连续型反欺诈模型。In another implementation manner, the preset training data set may be divided into multiple groups, and each group is trained and tested in turn, and the results of the previous set of model training and testing are used as the next set of model training and The basis of the test, that is, the training results of the upper and lower groups are related to each other. Throughout the training process, the model can be continuously optimized and improved to obtain the final continuous anti-fraud model.
当然,也不限定采用其他的模型训练方式对预设的训练数据集进行训练,来建立连续型反欺诈模型。Of course, it is not limited to use other model training methods to train the preset training data set to establish a continuous anti-fraud model.
步骤K20,基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据。Step K20: Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested.
在建立连续型反欺诈模型之后,即可利用建立的连续型反欺诈模型来对待测试 数据进行训练,以分析、识别出所述待测试数据中的欺诈数据。如可按建立连续型反欺诈模型时对预设的训练数据集的测试方式,以相同或相似的测试方式对需识别的待测试数据套用建立的连续型反欺诈模型进行训练、测试,根据训练、测试的结果识别出所述待测试数据中的欺诈数据。After establishing the continuous anti-fraud model, the established continuous anti-fraud model can be used to train the test data to analyze and identify the fraud data in the data to be tested. If the test method of the preset training data set is established according to the establishment of the continuous anti-fraud model, the continuous anti-fraud model established by the data to be tested to be identified is trained and tested in the same or similar test manner, according to the training. The result of the test identifies fraud data in the data to be tested.
由于在一些容易出现欺诈行为的场景如社保恶意报销等场景中,欺诈数据在整个社保大数据中的占比极其小,即欺诈数据存在大量的不均衡性,而若采用普通单模型来识别其中的欺诈数据,则会因为欺诈数据的不均衡特性,使得识别的精度和召回率偏低。因此,本实施例中针对欺诈数据的不均衡特性,建立连续型反欺诈模型来对待测试数据进行识别,如可同时利用多种模型共同投票的方法来进行欺诈数据的识别,能有效提高欺诈数据的识别精度和召回率,能够更加精确地判断欺诈案例从而缩小人工审查的范围和成本。In some scenarios where fraud is prone to occur, such as social security malicious reimbursement, the proportion of fraudulent data in the entire social security big data is extremely small, that is, there is a large amount of unbalanced fraud data, and if a common single model is used to identify The fraud data will result in low recognition accuracy and recall rate due to the unbalanced nature of fraudulent data. Therefore, in the embodiment, for the unbalanced characteristic of the fraud data, a continuous anti-fraud model is established to identify the test data, for example, the method of using multiple models to vote together can be used to identify fraud data, which can effectively improve fraud data. The recognition accuracy and recall rate can more accurately determine fraud cases and thus narrow the scope and cost of manual review.
本实施例采用预设的连续型模型训练方式建立连续型反欺诈模型,利用建立的连续型反欺诈模型来对待测试数据进行训练,识别所述待测试数据中的欺诈数据。由于针对待测试数据中欺诈数据为不均衡数据的特征,采用连续型反欺诈模型对待测试数据中的欺诈数据进行分析、识别,相比普通单模型能提高欺诈数据的识别精度和召回率,更加精确地判断欺诈案例,从而缩小人工审查的范围和成本。In this embodiment, a continuous anti-fraud model is established by using a preset continuous model training manner, and the continuous anti-fraud model is used to train the test data to identify fraud data in the data to be tested. Because the fraud data is unbalanced data in the data to be tested, the continuous anti-fraud model is used to analyze and identify the fraud data in the test data, which can improve the recognition accuracy and recall rate of the fraud data compared with the ordinary single model. Accurately determine fraud cases, thereby narrowing the scope and cost of manual review.
进一步地,在其他实施例中,对待测试数据中的欺诈数据进行分析、识别的连续型反欺诈模型采用直接连续型模型,上述步骤K10可以替换为:Further, in other embodiments, the continuous anti-fraud model for analyzing and identifying the fraud data in the test data adopts a direct continuous model, and the above step K10 can be replaced by:
将预设的训练数据集按预设比例分解为训练集和测试集;Decomposing the preset training data set into a training set and a test set according to a preset ratio;
保留所述测试集,按预设比例将所述训练集进一步分解为两个子训练集,所述两个子训练集分别作为下一层模型的训练集和测试集;Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;
依次重复划分训练集至预设次数;Repeating the division of the training set to a preset number of times;
分别利用划分的多层训练集,使用预设的经典模型来训练模型,并在保留的多层测试集上进行测试,建立直接连续型模型。Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.
本实施例中,可进行N重连续型模型的训练来建立直接连续型模型,其中,N为大于等于2的正整数,如可按以下步骤进行直接连续型模型的训练:In this embodiment, the N-continuous model training can be performed to establish a direct continuous model, wherein N is a positive integer greater than or equal to 2, for example, the direct continuous model can be trained as follows:
第一步:按照一定的预设比例分解预设的训练数据集为训练集Train_set和测试 集Test_set,保留测试集Test_set。Step 1: Decompose the preset training data set into a training set Train_set and a test set Test_set according to a certain preset ratio, and retain the test set Test_set.
第二步:按照一定的预设比例对训练集Train_set进行进一步分解为两个子训练集Train_setl1和Train_setl2,将两个子训练集Train_setl1和Train_setl2分别作为下一层模型的训练集和测试集。Step 2: The training set Train_set is further decomposed into two sub-training sets Train_setl1 and Train_setl2 according to a certain preset ratio, and the two sub-training sets Train_setl1 and Train_setl2 are respectively used as the training set and the test set of the next layer model.
重复第二步划分训练集至一定的预设次数。Repeat the second step to divide the training set to a certain preset number.
第三步:分别利用N层训练集使用预设的常用经典模型来训练模型并进行参数调优,在N层测试集上进行测试,进行参数调优并保留模型。其中,该经典模型包括但不限于决策树模型、随机森林模型等。The third step is to use the N-layer training set to train the model and perform parameter tuning using the preset common classical models, and test on the N-layer test set to adjust the parameters and retain the model. Among them, the classic model includes but is not limited to a decision tree model, a random forest model, and the like.
第四步:对保留的模型进行收集整理并调优,获取直接连续型模型。Step 4: Collect and calibrate the retained models to obtain a direct continuous model.
进一步地,上述步骤K20可以替换为:Further, the above step K20 can be replaced by:
对待测试数据进行与所述训练数据集中训练集比例相同的多层划分,并利用所述直接连续型模型对多层划分后的待测试数据分别进行训练,识别所述待测试数据中的欺诈数据。Performing a multi-layered division of the test data in the same proportion as the training set in the training data set, and using the direct continuous model to separately train the multi-layered data to be tested, and identifying fraud data in the data to be tested. .
在建立直接连续型模型之后,可利用建立的直接连续型模型来对待测试数据进行训练,以分析、识别出所述待测试数据中的欺诈数据。具体地,可对需进行欺诈识别的待测试数据进行与建立模型时重复多次划分训练集比例相同的随机分割,再利用建立的直接连续型模型对与所述训练数据集中训练集比例相同的多层划分后的所述待测试数据分别进行对应的模型训练,汇总对多层划分后的所述待测试数据分别进行对应模型训练的训练结果。根据该训练结果可获取对多层划分后的所述待测试数据分别进行对应模型训练后每一层中测试识别的欺诈数据,将每一层中测试识别的欺诈数据进行汇总即可获取最终所述待测试数据中的欺诈数据。After establishing the direct continuous model, the established direct continuous model can be used to train the test data to analyze and identify the fraud data in the data to be tested. Specifically, the data to be tested that needs to be fraudulently identified may be randomly divided into the same number of training sets when the model is established, and then the established direct continuous model is the same as the training set in the training data set. The multi-layered data to be tested is respectively subjected to corresponding model training, and the training results of the corresponding model training are respectively summarized for the multi-layer divided data to be tested. According to the training result, the fraud data that is tested and identified in each layer after the multi-layered divided data to be tested can be obtained, and the fraud data collected and tested in each layer is summarized to obtain the final Describe fraud data in the test data.
进一步地,在其他实施例中,对待测试数据中的欺诈数据进行分析、识别的连续型反欺诈模型采用优化连续型模型,上述步骤K10可以替换为:Further, in other embodiments, the continuous anti-fraud model for analyzing and identifying the fraud data in the test data adopts an optimized continuous model, and the above step K10 can be replaced by:
将预设的训练数据集按预设比例分解为训练集和测试集;Decomposing the preset training data set into a training set and a test set according to a preset ratio;
保留所述测试集,按预设比例将所述训练集进一步分解为两个子训练集,所述两个子训练集分别作为下一层模型的下层训练集和下层测试集;Retaining the test set, further decomposing the training set into two sub-trainets according to a preset ratio, where the two sub-train sets are respectively used as a lower training set and a lower test set of the next layer model;
利用下层训练集来训练模型,并在下层测试集上进行测试,根据测试结果获取 阳性样本并保留训练模型,将获取的阳性样本作为新的训练集;The lower training set is used to train the model, and the test is performed on the lower test set. The positive sample is obtained according to the test result and the training model is retained, and the obtained positive sample is used as a new training set;
依次重复进行划分训练集、测试的步骤,直至获取的阳性样本数量为零或者建立完多重训练模型;Repeat the steps of dividing the training set and testing until the number of positive samples obtained is zero or the multiple training model is established;
对建立的多重训练模型进行收集整理,获取优化连续型模型。The established multiple training models are collected and optimized to obtain an optimized continuous model.
本实施例中,可进行N重连续型模型的训练来建立优化连续型模型,如可按以下步骤进行优化连续型模型的训练:In this embodiment, the N-continuous model training can be performed to establish an optimized continuous model, for example, the following steps can be used to optimize the training of the continuous model:
第一步:按照一定的预设比例分解预设的训练数据集为训练集Train_set和测试集Test_set,保留测试集Test_set。Step 1: Decompose the preset training data set into a training set Train_set and a test set Test_set according to a certain preset ratio, and retain the test set Test_set.
第二步:按照一定的预设比例对训练集Train_set进行进一步分解为两个子训练集Train_setl1和Train_setl2,将两个子训练集Train_setl1和Train_setl2分别作为下一层模型的训练集和测试集。Step 2: The training set Train_set is further decomposed into two sub-training sets Train_setl1 and Train_setl2 according to a certain preset ratio, and the two sub-training sets Train_setl1 and Train_setl2 are respectively used as the training set and the test set of the next layer model.
第三步:利用下层训练集Train_setl1作为训练集来训练模型并调优,在下层测试集Train_setl2上进行测试,根据测试结果获取阳性样本并保留模型。The third step is to use the lower training set Train_setl1 as the training set to train the model and tune it, and test it on the lower test set Train_setl2, and obtain the positive sample according to the test result and retain the model.
第四步:提取第三步中得到的阳性样本组成训练集。Step 4: Extract the positive samples obtained in the third step to form a training set.
第五步:重复第二步至第四部直至第N重模型已经构建或者阳性样本数量为零,其中,N为大于等于2的正整数。Step 5: Repeat steps 2 through 4 until the Nth model has been constructed or the number of positive samples is zero, where N is a positive integer greater than or equal to 2.
第六步:对构建的N重模型即多重训练模型进行收集整理并调优,获取优化连续型模型。The sixth step: collecting and tuning the constructed N-heavy model, that is, the multi-training model, and obtaining an optimized continuous model.
进一步地,上述步骤K20可以替换为:Further, the above step K20 can be replaced by:
在待测试数据上利用优化连续型模型进行自上而下的测试,根据测试结果获取并保留阳性样本,以根据所述阳性样本识别所述待测试数据中的欺诈数据。A top-down test is performed on the data to be tested using the optimized continuous model, and a positive sample is acquired and retained according to the test result to identify fraud data in the data to be tested according to the positive sample.
在建立优化连续型模型之后,可利用建立的优化连续型模型来对待测试数据进行训练,以分析、识别出所述待测试数据中的欺诈数据。具体地,可直接在待测试数据上利用建立的优化连续型模型进行自上而下的预测,保留该优化连续型模型对待测试数据进行预测过程中的阳性样本,循环直到该优化连续型模型的第N重模型,将每一重模型对待测试数据预测的阳性样本进行汇总即可获取最终所述待测试数据中的欺诈数据。After establishing the optimized continuous model, the established optimized continuous model can be used to train the test data to analyze and identify the fraud data in the data to be tested. Specifically, the top-down prediction can be directly performed on the data to be tested by using the established optimized continuous model, and the positive sample in the prediction process of the optimized continuous model is retained, and the loop is continued until the optimized continuous model is The Nth model is obtained by summarizing the positive samples predicted by each heavy model for the test data to obtain the fraud data in the final test data.
如图8所示,在上述实施例三的基础上,在上述步骤K20之后还包括:As shown in FIG. 8, on the basis of the foregoing embodiment 3, after the step K20, the method further includes:
步骤K30,对所述欺诈数据的类型和/或来源进行标记。Step K30, marking the type and/or source of the fraudulent data.
本实施例中,在利用建立的连续型反欺诈模型识别出待测试数据中的欺诈数据之后,进一步地,还对识别出的欺诈数据的类型和/或来源进行标记,以标明欺诈数据的特征类型和/或来源,使得相关审查部门或相关工作人员对与已标记欺诈数据的类型、来源相同或相似的其他数据进行重点识别,缩小人工审查范围。例如社保医疗报销体系中存在一些恶意或者非法的刷卡、报销行为。在利用建立的连续型反欺诈模型识别出待测试的社保医疗报销数据中的欺诈数据之后,可对识别出的欺诈数据的类型和/或来源进行标记,如标记为中药、西药、诊疗等。这样,社保部门即可将中药、西药、诊疗作为可能出现虚假报销的高危区域进行严格管控,从而减少审查范围,提高欺诈数据识别的精度和效率。In this embodiment, after the fraudulent data in the data to be tested is identified by using the established continuous anti-fraud model, the type and/or source of the identified fraud data is further marked to indicate the characteristics of the fraud data. The type and/or source enables the relevant review department or relevant staff to focus on identifying other types of data that are the same or similar to the type and source of the fraudulent data, and narrow the scope of manual review. For example, there are some malicious or illegal credit card and reimbursement behaviors in the social security medical reimbursement system. After identifying the fraud data in the social security medical reimbursement data to be tested by using the established continuous anti-fraud model, the type and/or source of the identified fraud data may be marked, such as Chinese medicine, western medicine, medical treatment, and the like. In this way, the social security department can strictly control Chinese medicine, western medicine, and diagnosis and treatment as high-risk areas where false reimbursement may occur, thereby reducing the scope of examination and improving the accuracy and efficiency of fraud data identification.
实施例四:Embodiment 4:
请参阅图9,本申请实施例中一种欺诈数据的识别装置一个实施例包括:Referring to FIG. 9, an embodiment of an apparatus for identifying fraud data in an embodiment of the present application includes:
建模模块01,用于采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The modeling module 01 is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;
识别模块02,用于基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据。The identification module 02 is configured to train the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested.
本实施例采用预设的连续型模型训练方式建立连续型反欺诈模型,利用建立的连续型反欺诈模型来对待测试数据进行训练,识别所述待测试数据中的欺诈数据。由于针对待测试数据中欺诈数据为不均衡数据的特征,采用连续型反欺诈模型对待测试数据中的欺诈数据进行分析、识别,相比普通单模型能提高欺诈数据的识别精度和召回率,更加精确地判断欺诈案例,从而缩小人工审查的范围和成本。In this embodiment, a continuous anti-fraud model is established by using a preset continuous model training manner, and the continuous anti-fraud model is used to train the test data to identify fraud data in the data to be tested. Because the fraud data is unbalanced data in the data to be tested, the continuous anti-fraud model is used to analyze and identify the fraud data in the test data, which can improve the recognition accuracy and recall rate of the fraud data compared with the ordinary single model. Accurately determine fraud cases, thereby narrowing the scope and cost of manual review.
实施例五:Embodiment 5:
请参阅图10,本申请实施例中一种欺诈行为的识别方法一个实施例包括:Referring to FIG. 10, an embodiment of a method for identifying fraudulent behavior in an embodiment of the present application includes:
基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各 个节点的欺诈率。Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship; and the group medical treatment behavior of each node in the relationship network is analyzed, The multi-dimensional group medical treatment characteristics corresponding to each node are extracted; the extracted multi-dimensional group medical treatment characteristics are input into a preset classification model to identify the fraud rate of each node according to the classification model.
以下是本实施例中逐步实现对社保欺诈行为识别的具体步骤:The following are the specific steps to gradually realize the identification of social security fraud behavior in this embodiment:
步骤Y10,基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Step Y10, establishing a relationship network between doctors and patients and a drug diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
在本实施例中,先从数据库中获取社保就诊数据,在获取到社保就诊数据之后,可直接基于社保就诊数据建立医患、药诊的关系网络。其中,所述关系网络的节点包括但不限于:医院、医生、病患、区域、疾病和药品项目等。In this embodiment, the social security medical treatment data is first obtained from the database, and after obtaining the social security medical treatment data, the relationship network between the medical doctor and the medical diagnosis can be established directly based on the social security medical treatment data. The nodes of the relationship network include, but are not limited to, hospitals, doctors, patients, regions, diseases, and medicine projects.
进一步地,在获取到社保就诊数据之后,还可对获取到的社保就诊数据进行敏感信息的处理,敏感信息处理表示:采用敏感信息处理规则对数据中的敏感信息进行数据的变形,以实现敏感隐私数据的保护。后续,即可基于敏感信息处理后的社保就诊数据建立医患、药诊的关系网络。优选下文中的社保就诊数据都是敏感信息处理后的社保就诊数据,下文不再一一赘述。Further, after obtaining the social security medical treatment data, the acquired social security medical treatment data may be processed by sensitive information, and the sensitive information processing indicates that the sensitive information processing rule is used to deform the sensitive information in the data to achieve sensitivity. Protection of privacy data. Afterwards, a network of doctor-patient and drug diagnosis relationships can be established based on social security treatment data after sensitive information processing. Preferably, the social security treatment data below is the social security treatment data after the sensitive information is processed, and will not be further described below.
具体地,参照图11,所述步骤Y10包括:Specifically, referring to FIG. 11, the step Y10 includes:
步骤Y11,对社保就诊数据进行数据处理;Step Y11, performing data processing on the social security treatment data;
步骤Y12,根据数据处理后的社保就诊数据建立医患、药诊的关系网络。In step Y12, a network of doctor-patient and drug diagnosis relationships is established according to the social security treatment data after the data processing.
在本实施例中,获取到社保就诊数据之后,先对社保就诊数据进行数据处理,该处理数据可以包括对数据进行去噪去干扰处理,以便于后续建立的关系网络更准确,对社保就诊数据进行数据处理之后,根据数据处理后的社保就诊数据建立医患、药诊的关系网络。In this embodiment, after obtaining the social security medical treatment data, data processing is performed on the social security medical treatment data, and the processed data may include denoising and interference processing on the data, so as to facilitate the subsequent establishment of the relationship network, and the social security medical treatment data. After the data processing, a network of doctor-patient and drug diagnosis relationships is established based on the social security treatment data after the data processing.
本实施例中,基于社保就诊数据建立的关系网络,可参照图15。如图15所示,所述关系网络包括多个节点,节点分别是:医院、医生、病患、区域、疾病和药品项目等等。从图15中可看出,所述关系网络中,各个节点之间隶属不同的关系,例如,医生和医院之间的关系是:医生属于(BELONG)医院;医生和疾病之间的关系是:医生诊断(DIAGNOSE)疾病;病患和药品项目的关系是:病患购买(BUY)药品项目;病患和疾病的关系是:病患患有(HAS)疾病等等。通过所述关系网络,可全方位监控患者的就医行为应当理解,图15所举例的关系网络图仅仅是本实施例中的一个较佳示意图,且图15展示的关系网络只是本实施例中关系网络的一个小部分,从图15的关系网络中可看出,各个节 点都是不同类型的节点,因此各个节点都是不同属性的节点。但是,在本实施例的关系网络中,实际上可包括多个相同的属性的节点,如包括多个医生的节点,或者包括多个病患的节点,并且,属性相同的各个节点之间也隶属有不同的关系。因此,本实施例中的节点并不限定于上述所举例的内容,在社保就诊数据变化的情况下,还会得到不同的关系网络以及节点,在此不进行一一穷举。In this embodiment, the relationship network established based on the social security visit data can refer to FIG. 15. As shown in FIG. 15, the relationship network includes a plurality of nodes, which are: a hospital, a doctor, a patient, a region, a disease and a medicine project, and the like. As can be seen from FIG. 15, in the relationship network, each node belongs to a different relationship. For example, the relationship between the doctor and the hospital is: the doctor belongs to (BELONG) hospital; the relationship between the doctor and the disease is: Doctor Diagnostics (DIAGNOSE) disease; the relationship between the patient and the drug program is: Patient Purchase (BUY) drug program; the relationship between the patient and the disease is: Patient with (HAS) disease and so on. Through the relationship network, the patient's medical treatment behavior can be comprehensively monitored. It should be understood that the relationship network diagram illustrated in FIG. 15 is only a preferred schematic diagram in this embodiment, and the relationship network shown in FIG. 15 is only the relationship in this embodiment. A small part of the network, as can be seen from the relationship network of Figure 15, each node is a different type of node, so each node is a node with different attributes. However, in the relational network of the present embodiment, a plurality of nodes of the same attribute may be actually included, such as a node including a plurality of doctors, or a node including a plurality of patients, and each node having the same attribute is also Membership has different relationships. Therefore, the nodes in this embodiment are not limited to the above-exemplified contents. In the case where the social security medical treatment data changes, different relational networks and nodes are also obtained, which are not exhaustive.
步骤Y20,对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;Step Y20, analyzing the group medical treatment behavior of each node in the relationship network, to extract the multi-dimensional group medical treatment characteristics corresponding to each node;
在本实施例中,在基于社保就诊数据建立医患、药诊的关系网络之后,对所述关系网络中各个节点的群体性就医行为进行分析,本实施例中,对各个节点的群体性就医行为进行分析,继续以图15为例,就是对关系网络中所呈现出来的就医行为进行分析,相当于是对患者就医行为分析、对医生治疗行为分析或者是疾病治疗手段分析等等。由于所述关系网络中各个节点之间隶属不同的关系,且每个节点不再是受到单维度的影响,而是受到所述关系网络中的其它各个节点的综合影响,因此对每个节点的群体性就医行为进行分析,最终可得到每个节点的多维度群体性就医特征,所述就医特征就是就医行为中提取出的特征。以图15中的病患节点为例,该病患节点的群体性就医行为包括:病患所在的区域,病患看病的医院、病患采购药品项目的数量和具体时间,病患患得的疾病,病患看诊的医生等行为。对病患的群体性就医行为进行分析,就相当于对病患所在的区域、病患采购药品项目的数量和具体时间、病患患得的疾病等进行综合分析。若查到病患多次在不同的医院购买大量的药品,且药品的种类各不相同,可确定群体性就医特征为:用户的药品购买量大、药品类型多等等。In this embodiment, after establishing a relationship network between the doctors and the patients and the medical diagnosis based on the social security medical treatment data, the group medical treatment behavior of each node in the relationship network is analyzed. In this embodiment, the group medical treatment for each node is performed. The behavior analysis, continue to take Figure 15 as an example, is to analyze the medical behavior presented in the relationship network, which is equivalent to the analysis of the patient's medical behavior, the analysis of the doctor's treatment behavior or the analysis of the disease treatment methods. Since each node in the relational network is subject to a different relationship, and each node is no longer affected by a single dimension but by a comprehensive influence of other nodes in the relationship network, for each node The analysis of group medical treatment behavior can finally obtain the multi-dimensional group medical treatment characteristics of each node, and the medical treatment characteristics are the characteristics extracted from the medical treatment behavior. Taking the patient node in Figure 15 as an example, the group medical treatment behavior of the patient node includes: the area where the patient is located, the hospital where the patient is visiting, the number of patients purchasing the drug items, and the specific time, and the patient suffers from Diseases, doctors who visit patients, etc. The analysis of the group's group medical treatment behavior is equivalent to comprehensive analysis of the area where the patient is located, the number of patients purchasing medicines and the specific time, and the diseases suffered by the patients. If it is found that the patient has purchased a large number of medicines in different hospitals many times, and the types of medicines are different, it can be determined that the group medical treatment characteristics are: the user's medicine purchase amount is large, the medicine type is many, and the like.
步骤Y30,将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率。In step Y30, the extracted multi-dimensional group medical treatment features are input to a preset classification model to identify the fraud rate of each node according to the classification model.
在提取出各个节点对应的多维度群体就医特征之后,将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率。具体地,参照图12,所述步骤Y30包括:After the multi-dimensional group medical treatment features corresponding to the respective nodes are extracted, the extracted multi-dimensional group medical treatment features are input to a preset classification model to identify the fraud rate of each node according to the classification model. Specifically, referring to FIG. 12, the step Y30 includes:
步骤Y31,根据各个节点对应的多维度群体就医特征,计算同属性的各个节点 的多维度群体就医特征的相似度;Step Y31, calculating the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute according to the multi-dimensional group medical treatment characteristics corresponding to each node;
步骤Y32,将计算的各个节点的相似度输入到预设的分类模型中,以根据所述分类模型中预设的欺诈检测公式,计算各个节点的欺诈率。In step Y32, the calculated similarity of each node is input into a preset classification model to calculate a fraud rate of each node according to a fraud detection formula preset in the classification model.
也就是说,在提取出各个节点对应的多维度群体就医特征之后,计算同属性的各个节点的多维度群体就医特征的相似度。所述相同属性的节点如:医生节点和医生节点,或者病患节点和病患节点。That is to say, after extracting the multi-dimensional group medical treatment characteristics corresponding to each node, the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute is calculated. The nodes of the same attribute are: a doctor node and a doctor node, or a patient node and a patient node.
本实施例中,计算同属性的各个节点的多维度群体就医特征的相似度,优选采用以下几种算法实现:In this embodiment, the similarity of the multi-dimensional group medical treatment features of each node of the same attribute is calculated, and the following algorithms are preferably implemented:
1)Jaccard Similarity(表示广义相似度):1) Jaccard Similarity (representing generalized similarity):
Jaccard(A,B)=|A intersect B|/|A union B|Jaccard(A,B)=|A intersect B|/|A union B|
其中,Intersect表示交集,Union表示并集,A和B表示相同属性的节点,如A和B都表示图15中的医生节点,或者都表示病患节点。Among them, Intersect represents intersection, Union represents union, A and B represent nodes of the same attribute, such as A and B both represent the doctor node in Figure 15, or both represent the patient node.
2)Euclidean similarity(欧几里德距离的相似度):2) Euclidean similarity (similarity of Euclidean distance):
Euclidean(A,B)=1-euclidean_distance(A,B)Euclidean(A,B)=1-euclidean_distance(A,B)
其中,A和B表示相同属性的节点。Among them, A and B represent nodes of the same attribute.
以上所列举出的两种计算同属性的各个节点的多维度群体就医特征的相似度的算法仅仅为示例性的,本领域技术人员利用本申请的技术思想,根据其具体需求所提出的其它算法均在本申请的保护范围内,在此不进行一一穷举。The two algorithms enumerated above for calculating the similarity of the multi-dimensional group medical treatment features of the respective nodes of the same attribute are merely exemplary, and those skilled in the art utilize the technical idea of the present application to propose other algorithms according to their specific needs. All of them are within the scope of protection of the present application, and are not exhaustive here.
通过上述的相似度计算公式,即可确定任意两个相同属性的节点的多维度群体就医特征的相似度。Through the above similarity calculation formula, the similarity of the multi-dimensional group medical treatment features of any two nodes of the same attribute can be determined.
在确定出同属性的各个节点的多维度群体就医特征的相似度之后,将计算的各个节点的相似度输入到预设的分类模型中,以根据所述分类模型中预设的欺诈检测公式,计算各个节点的欺诈率。其中,所述欺诈检测公式优选包括:KNN(k-Nearest Neighbor algorithm,K最邻近结点算法,K取5)算法的公式;二分Kmeans算法的公式;Shewhart methods算法的公式等等,由于这些算法的公式都是现有的公式,此处不对计算过程进行赘述。After determining the similarity of the multi-dimensional group medical treatment feature of each node of the same attribute, the calculated similarity of each node is input into a preset classification model, according to a fraud detection formula preset in the classification model, Calculate the fraud rate of each node. The fraud detection formula preferably includes: KNN (k-Nearest Neighbor algorithm, K nearest neighbor node algorithm, K takes 5) algorithm formula; binary Kmeans algorithm formula; Shewhart methods algorithm formula, etc., due to these algorithms The formulas are all existing formulas, and the calculation process will not be described here.
进一步地,为了提高分类模型计算节点欺诈率的准确性,本实施例中,所述步骤Y32之后,所述社保欺诈行为的识别方法还包括:Further, in order to improve the accuracy of the classification model computing node fraud rate, in the embodiment, after the step Y32, the method for identifying the social security fraud behavior further includes:
步骤A,对各个节点的欺诈率进行验证,以将验证结论添加到各个节点的欺诈率中;Step A: verifying the fraud rate of each node to add the verification conclusion to the fraud rate of each node;
步骤B,将添加有验证结论的欺诈率重新输入到所述分类模型中,以便于训练所述分类模型。In step B, the fraud rate added with the verification conclusion is re-entered into the classification model to facilitate training the classification model.
也就是说,在根据所述分类模型中预设的欺诈检测公式,计算各个节点的欺诈率之后,还可对各个节点的欺诈率进行验证,本实施例中,所述验证方式优选是线下的审批验证,对各个节点的欺诈率进行验证之后,将验证结论添加到各个节点的欺诈率中,并将添加有验证结论的欺诈率重新输入到所述分类模型中,以便于训练所述分类模型,使得后续所述分类模型对节点欺诈率的识别更加准确。That is to say, after calculating the fraud rate of each node according to the fraud detection formula preset in the classification model, the fraud rate of each node can also be verified. In this embodiment, the verification mode is preferably offline. Approval verification, after verifying the fraud rate of each node, adding the verification conclusion to the fraud rate of each node, and re-entering the fraud rate with the verification conclusion added to the classification model, so as to train the classification The model makes the identification of the node fraud rate more accurate by the subsequent classification model.
本实施例基于关系网络的社保欺诈行为识别就是在群体维度上,对群体的就诊行为建立医疗就诊的关系网络,并设计算法模型从群体维度识别欺诈行为,以得到节点的欺诈率,实现了对群体维度的社保行为进行识别。可以理解,通过对用户的社保就诊数据进行分析,若检测出多个节点的欺诈率都较高,仅仅有个别节点的欺诈率较低,此时可认为该用户存在社保欺诈行为,相对于单一规则触发机制,通过群体性的就诊行为确定用户是否存在社保欺诈行为,社保欺诈行为识别的准确率更高一些。In this embodiment, the social security fraud behavior recognition based on the relational network is to establish a medical treatment network for the group's visiting behavior in the group dimension, and design an algorithm model to identify the fraud behavior from the group dimension to obtain the node fraud rate and achieve the right The social security behavior of the group dimension is identified. It can be understood that by analyzing the social security visit data of the user, if the fraud rate of multiple nodes is detected to be high, only the fraud rate of the individual node is low, and the user may be considered to have social security fraud behavior, compared to a single The rule trigger mechanism determines whether the user has social security fraud behavior through group visit behavior, and the accuracy rate of social security fraud behavior recognition is higher.
本实施例提出的社保欺诈行为的识别方法,先基于社保就诊数据建立医患、药诊的关系网络,然后对所述关系网络中各个节点的群体性就医行为进行分析,以提取出多维度群体就医特征,最终将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率。本方案从多维度多角度对社保欺诈行为进行识别,相对传统的单一规则识别,对社保欺诈行为识别的准确性更高。The identification method of social security fraud behavior proposed in this embodiment first establishes a relationship network of doctors and patients and drug diagnosis based on the social security medical treatment data, and then analyzes the group medical treatment behavior of each node in the relationship network to extract a multi-dimensional group. The medical treatment feature finally inputs the extracted multi-dimensional group medical treatment characteristics into a preset classification model to identify the fraud rate of each node according to the classification model. This program identifies social security fraud behaviors from multiple dimensions and perspectives. Compared with traditional single rule identification, the accuracy of social security fraud behavior recognition is higher.
进一步地,为了提高社保欺诈行为的识别的准确性,基于上一实施例提出本申请欺诈行为的识别方法的另一个实施例。Further, in order to improve the accuracy of the identification of the social security fraud behavior, another embodiment of the identification method of the fraudulent activity of the present application is proposed based on the previous embodiment.
在本实施例中,参照图13,所述步骤Y20之前,所述社保欺诈行为的识别方法还包括:In this embodiment, referring to FIG. 13, before the step Y20, the method for identifying the social security fraud behavior further includes:
步骤Y40,在所述关系网络中确定待补充的外部因子特征,并从互联网中获取 所述外部因子特征;Step Y40, determining an external factor feature to be supplemented in the relationship network, and acquiring the external factor feature from the Internet;
步骤Y50,基于获取的所述外部因子特征生成新节点;Step Y50: Generate a new node based on the obtained external factor feature;
步骤Y60,将所述新节点添加到所述关系网络中,以更新所述关系网络。In step Y60, the new node is added to the relationship network to update the relationship network.
在本实施例中,先在所述关系网络中确定待补充的外部因子特征,并从互联网中获取所述外部因子特征,所述外部因子特征指的是节点关联的外部信息,例如,节点是医院,那么外部因子特征就是医院相关信息,比如医院地址信息等。在获取到外部因子特征之后,先基于获取的所述外部因子特征生成新节点,最终将所述新节点添加到所述关系网络中,以更新所述关系网络,使得后续的关系网络中,节点更加详细,对后续各个节点的欺诈率的识别也更加准确。In this embodiment, the external factor feature to be supplemented is first determined in the relationship network, and the external factor feature is obtained from the Internet, where the external factor feature refers to external information associated with the node, for example, the node is Hospital, then the external factor characteristics are hospital-related information, such as hospital address information. After acquiring the external factor feature, first generating a new node based on the acquired external factor feature, and finally adding the new node to the relationship network to update the relationship network, so that the node in the subsequent relationship network In more detail, the identification of the fraud rate of each subsequent node is also more accurate.
本申请值得注意的是,虽然涉及到的每个算法都是现有的算法,但是整个社保欺诈行为的识别过程中,所采用的完整流程,与现有的社保欺诈行为的识别并不相同,本申请克服了现有的社保欺诈行为识别准确性低的问题。It is worth noting in this application that although each algorithm involved is an existing algorithm, the entire process used in the identification process of social security fraud is not the same as the identification of existing social security fraud. The application overcomes the problem that the existing social security fraud behavior recognition accuracy is low.
实施例六:Example 6:
本申请进一步提供一种欺诈行为的识别装置。The application further provides an identification device for fraudulent activity.
参照图14,图14为本申请欺诈行为的识别装置100第一实施例的功能模块示意图。Referring to FIG. 14, FIG. 14 is a schematic diagram of functional modules of a first embodiment of an apparatus 100 for identifying fraudulent behavior of the present application.
在本实施例中,所述欺诈行为的识别装置100包括:In this embodiment, the fraud detection apparatus 100 includes:
建立模块10,用于基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;The establishing module 10 is configured to establish a relationship network between the doctor and the patient and the medical diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
分析提取模块20,用于对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;The analysis extraction module 20 is configured to analyze the group medical treatment behavior of each node in the relationship network, so as to extract the multi-dimensional group medical treatment characteristics corresponding to each node;
输入识别模块30,用于将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率。The input identification module 30 is configured to input the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model.
本实施例中,基于社保就诊数据建立的关系网络,可参照图15。如图15所示,所述关系网络包括多个节点,节点分别是:医院、医生、病患、区域、疾病和药品项目等等。从图15中可看出,所述关系网络中,各个节点之间隶属不同的关系,例如,医生和医院之间的关系是:医生属于(BELONG)医院;医生和疾病之间的关系是:医生诊断(DIAGNOSE)疾病;病患和药品项目的关系是 :病患购买(BUY)药品项目;病患和疾病的关系是:病患患有(HAS)疾病等等。通过所述关系网络,可全方位监控患者的就医行为。In this embodiment, the relationship network established based on the social security visit data can refer to FIG. 15. As shown in FIG. 15, the relationship network includes a plurality of nodes, which are: a hospital, a doctor, a patient, a region, a disease and a medicine project, and the like. As can be seen from FIG. 15, in the relationship network, each node belongs to a different relationship. For example, the relationship between the doctor and the hospital is: the doctor belongs to (BELONG) hospital; the relationship between the doctor and the disease is: Doctor Diagnostics (DIAGNOSE) disease; the relationship between the patient and the drug program is: Patient Purchase (BUY) drug program; the relationship between the patient and the disease is: Patient with (HAS) disease and so on. Through the relationship network, the patient's medical behavior can be monitored in all aspects.
图16是本申请一实施例提供的承载平安脑大数据平台的服务器的示意图。如图16所示,该实施例的服务器21包括:处理器210、存储器211以及存储在所述存储器211中并可在所述处理器210上运行的计算机可读指令212,例如执行反欺诈识别方法的程序。所述处理器210执行所述计算机可读指令212时实现上述各个反欺诈识别方法实施例中的步骤,例如图1所示的步骤101至103。或者,所述处理器210执行所述计算机可读指令212时实现上述各装置实施例中各模块/单元的功能,例如图6所示模块510至550的功能、图9所示模块01至02的功能、图14所示模块10至30的功能。FIG. 16 is a schematic diagram of a server carrying a Ping An Brain Big Data platform according to an embodiment of the present application. As shown in FIG. 16, the server 21 of this embodiment includes a processor 210, a memory 211, and computer readable instructions 212 stored in the memory 211 and executable on the processor 210, for example, performing anti-fraud identification. Method of procedure. When the processor 210 executes the computer readable instructions 212, the steps in the embodiments of the various anti-fraud identification methods described above are implemented, such as steps 101 to 103 shown in FIG. Alternatively, when the processor 210 executes the computer readable instructions 212, the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 510 to 550 shown in FIG. 6, and the modules 01 to 02 shown in FIG. The function of the modules 10 to 30 shown in Fig. 14 is.
示例性的,所述计算机可读指令212可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在计算机可读存储介质,例如所述存储器211中,并由所述处理器210执行,以完成本申请。Illustratively, the computer readable instructions 212 may be partitioned into one or more modules/units, which are stored in a computer readable storage medium, such as the memory 211, and The processor 210 executes to complete the application.
所述服务器21可以是承载平安脑大数据平台的云端服务器等计算设备。所述服务器可包括,但不仅限于,处理器210、存储器211。The server 21 may be a computing device such as a cloud server that carries the Ping Brain Big Data platform. The server may include, but is not limited to, a processor 210, a memory 211.

Claims (20)

  1. 一种反欺诈识别方法,其特征在于,包括:An anti-fraud identification method, comprising:
    确定目标事件;Identify target events;
    提取与所述目标事件相关的目标数据;Extracting target data related to the target event;
    采用如下构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理;The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;
    所述构建决策模型的方法包括:The method of constructing a decision model includes:
    获取规则模板数据,并提取所述规则模板数据中的各个变量对象及各个模板样本;Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;
    对所述变量对象进行聚类分析,得到聚类结果;Perform cluster analysis on the variable object to obtain a clustering result;
    根据所述规则模板数据将所述聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征;Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;
    分别计算各个变量对象的黑样本概率,并将所述各个变量对象的黑样本概率作为第二特征;Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;
    通过所述第一特征与所述第二特征构建决策模型;Constructing a decision model by the first feature and the second feature;
    所述欺诈数据的识别方法包括:The method for identifying the fraud data includes:
    采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;
    基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据;Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;
    所述欺诈行为的识别方法包括:The method for identifying the fraud behavior includes:
    基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
    对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;
    将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率;Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;
    其中,所述目标数据包括待测试数据、规则模板数据和社保就诊 数据中的至少两个。The target data includes at least two of data to be tested, rule template data, and social security visit data.
  2. 根据权利要求1所述的反欺诈识别方法,其特征在于,在所述通过所述第一特征与所述第二特征构建决策模型之前,所述构建决策模型的方法还包括:The anti-fraud identification method according to claim 1, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:
    按照预设算法将各个变量对象映射到预先定义的标签中;Mapping each variable object to a predefined tag according to a preset algorithm;
    根据所述规则模板数据将所述标签与各个模板样本进行匹配,并将匹配后的标签作为第三特征;Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;
    所述通过所述第一特征与所述第二特征构建决策模型,具体包括:The constructing the decision model by using the first feature and the second feature includes:
    通过所述第一特征、所述第二特征及所述第三特征构建决策模型。A decision model is constructed by the first feature, the second feature, and the third feature.
  3. 根据权利要求2所述的反欺诈识别方法,其特征在于,所述通过所述第一特征、所述第二特征及所述第三特征构建决策模型,包括:The anti-fraud identification method according to claim 2, wherein the constructing the decision model by using the first feature, the second feature, and the third feature comprises:
    建立原始节点;Establish the original node;
    根据所述规则模板数据获取各个模板样本的结果类型;Obtaining a result type of each template sample according to the rule template data;
    分别遍历读取所述第一特征、所述第二特征及所述第三特征,生成读取记录;Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;
    根据所述各个模板样本的结果类型计算各条读取记录的分割纯度,并根据所述分割纯度确定分割点;Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;
    获取与所述分割点对应的特征,并建立新的节点。Obtaining features corresponding to the segmentation points and establishing new nodes.
  4. 根据权利要求1至3中任一项所述的反欺诈识别方法,其特征在于,所述对所述变量对象进行聚类分析,得到聚类结果,包括:The anti-fraud identification method according to any one of claims 1 to 3, wherein the clustering analysis is performed on the variable object to obtain a clustering result, including:
    从所述变量对象中随机选择多个变量对象分别作为聚类的第一聚类中心,每个第一聚类中心对应一个聚类;Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;
    分别计算各个变量对象到各个第一聚类中心的距离;Calculating the distance of each variable object to each first cluster center separately;
    根据计算结果对各个变量对象进行划分,将变量对象划分到距离最短的第一聚类中心对应的聚类中;Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;
    分别计算划分后的各个聚类的第二聚类中心;Calculating a second cluster center of each of the divided clusters separately;
    判断各个聚类中的第一聚类中心与第二聚类中心的距离是否小于预设阈值,若是,则将各个聚类作为聚类结果输出,若否,则将第二聚类中心替代对应的聚类的第一聚类中心,并继续执行所述分别计算各个变量对象到各个第一聚类中心的距离的步骤。Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
  5. 根据权利要求1所述的反欺诈识别方法,其特征在于,所述连续型反欺诈模型为直接连续型模型;The anti-fraud identification method according to claim 1, wherein the continuous anti-fraud model is a direct continuous model;
    所述采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型包括:The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:
    将预设的训练数据集按预设比例分解为训练集和测试集;Decomposing the preset training data set into a training set and a test set according to a preset ratio;
    保留所述测试集,按预设比例将所述训练集进一步分解为两个子训练集,所述两个子训练集分别作为下一层模型的训练集和测试集;Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;
    依次重复划分训练集至预设次数;Repeating the division of the training set to a preset number of times;
    分别利用划分的多层训练集,使用预设的经典模型来训练模型,并在保留的多层测试集上进行测试,建立直接连续型模型。Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.
  6. 根据权利要求1所述的反欺诈识别方法,其特征在于,所述连续型反欺诈模型为优化连续型模型;The anti-fraud identification method according to claim 1, wherein the continuous anti-fraud model is an optimized continuous model;
    所述采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型包括:The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:
    将预设的训练数据集按预设比例分解为训练集和测试集;Decomposing the preset training data set into a training set and a test set according to a preset ratio;
    保留所述测试集,按预设比例将所述训练集进一步分解为两个子训练集,所述两个子训练集分别作为下一层模型的下层训练集和下层测试集;Retaining the test set, further decomposing the training set into two sub-trainets according to a preset ratio, where the two sub-train sets are respectively used as a lower training set and a lower test set of the next layer model;
    利用下层训练集来训练模型,并在下层测试集上进行测试,根据测试结果获取阳性样本并保留训练模型,将获取的阳性样本作为新的训练集;The lower training set is used to train the model, and the test is performed on the lower test set. The positive sample is obtained according to the test result and the training model is retained, and the obtained positive sample is used as a new training set;
    依次重复进行划分训练集、测试的步骤,直至获取的阳性样本数 量为零或者建立完多重训练模型;Repeat the steps of dividing the training set and testing until the number of positive samples obtained is zero or the multiple training model is established;
    对建立的多重训练模型进行收集整理,获取优化连续型模型。The established multiple training models are collected and optimized to obtain an optimized continuous model.
  7. 根据权利要求1所述的反欺诈识别方法,其特征在于,所述基于社保就诊数据建立医患、药诊的关系网络的步骤包括:The anti-fraud identification method according to claim 1, wherein the step of establishing a relationship network between the doctors and the patients and the medical diagnosis based on the social security medical treatment data comprises:
    对社保就诊数据进行数据处理;Data processing of social security treatment data;
    根据数据处理后的社保就诊数据建立医患、药诊的关系网络。Establish a network of doctor-patient and drug diagnosis based on social security treatment data after data processing.
  8. 根据权利要求1所述的反欺诈识别方法,其特征在于,所述将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率的步骤包括:The anti-fraud identification method according to claim 1, wherein the step of inputting the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model include:
    根据各个节点对应的多维度群体就医特征,计算同属性的各个节点的多维度群体就医特征的相似度;Calculating the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute according to the multi-dimensional group medical treatment characteristics corresponding to each node;
    将计算的各个节点的相似度输入到预设的分类模型中,以根据所述分类模型中预设的欺诈检测公式,计算各个节点的欺诈率。The calculated similarity of each node is input into a preset classification model to calculate a fraud rate of each node according to a fraud detection formula preset in the classification model.
  9. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:
    确定目标事件;Identify target events;
    提取与所述目标事件相关的目标数据;Extracting target data related to the target event;
    采用如下构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理;The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;
    所述构建决策模型的方法包括:The method of constructing a decision model includes:
    获取规则模板数据,并提取所述规则模板数据中的各个变量对象及各个模板样本;Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;
    对所述变量对象进行聚类分析,得到聚类结果;Perform cluster analysis on the variable object to obtain a clustering result;
    根据所述规则模板数据将所述聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征;Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;
    分别计算各个变量对象的黑样本概率,并将所述各个变量对象的黑样本概率作为第二特征;Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;
    通过所述第一特征与所述第二特征构建决策模型;Constructing a decision model by the first feature and the second feature;
    所述欺诈数据的识别方法包括:The method for identifying the fraud data includes:
    采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;
    基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据;Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;
    所述欺诈行为的识别方法包括:The method for identifying the fraud behavior includes:
    基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
    对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;
    将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率;Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;
    其中,所述目标数据包括待测试数据、规则模板数据和社保就诊数据中的至少两个。The target data includes at least two of data to be tested, rule template data, and social security visit data.
  10. 根据权利要求9所述的计算机可读存储介质,其特征在于,在所述通过所述第一特征与所述第二特征构建决策模型之前,所述构建决策模型的方法还包括:The computer readable storage medium according to claim 9, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:
    按照预设算法将各个变量对象映射到预先定义的标签中;Mapping each variable object to a predefined tag according to a preset algorithm;
    根据所述规则模板数据将所述标签与各个模板样本进行匹配,并将匹配后的标签作为第三特征;Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;
    所述通过所述第一特征与所述第二特征构建决策模型,具体包括:The constructing the decision model by using the first feature and the second feature includes:
    通过所述第一特征、所述第二特征及所述第三特征构建决策模型。A decision model is constructed by the first feature, the second feature, and the third feature.
  11. 根据权利要求10所述的计算机可读存储介质,其特征在于,所述通过所述第一特征、所述第二特征及所述第三特征构建决策模型,包括:The computer readable storage medium according to claim 10, wherein the constructing the decision model by the first feature, the second feature, and the third feature comprises:
    建立原始节点;Establish the original node;
    根据所述规则模板数据获取各个模板样本的结果类型;Obtaining a result type of each template sample according to the rule template data;
    分别遍历读取所述第一特征、所述第二特征及所述第三特征,生成读取记录;Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;
    根据所述各个模板样本的结果类型计算各条读取记录的分割纯度,并根据所述分割纯度确定分割点;Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;
    获取与所述分割点对应的特征,并建立新的节点。Obtaining features corresponding to the segmentation points and establishing new nodes.
  12. 根据权利要求9至11中任一项所述的计算机可读存储介质,其特征在于,所述对所述变量对象进行聚类分析,得到聚类结果,包括:The computer readable storage medium according to any one of claims 9 to 11, wherein the clustering analysis of the variable object to obtain a clustering result comprises:
    从所述变量对象中随机选择多个变量对象分别作为聚类的第一聚类中心,每个第一聚类中心对应一个聚类;Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;
    分别计算各个变量对象到各个第一聚类中心的距离;Calculating the distance of each variable object to each first cluster center separately;
    根据计算结果对各个变量对象进行划分,将变量对象划分到距离最短的第一聚类中心对应的聚类中;Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;
    分别计算划分后的各个聚类的第二聚类中心;Calculating a second cluster center of each of the divided clusters separately;
    判断各个聚类中的第一聚类中心与第二聚类中心的距离是否小于预设阈值,若是,则将各个聚类作为聚类结果输出,若否,则将第二聚类中心替代对应的聚类的第一聚类中心,并继续执行所述分别计算各个变量对象到各个第一聚类中心的距离的步骤。Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
  13. 一种承载平安脑大数据平台的服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A server carrying a Ping An Brain Big Data platform, comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor executes the computer The following steps are implemented when the instructions are readable:
    确定目标事件;Identify target events;
    提取与所述目标事件相关的目标数据;Extracting target data related to the target event;
    采用如下构建决策模型的方法、欺诈数据的识别方法、欺诈行为的识别方法中的至少两个方法对所述目标数据进行处理;The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;
    所述构建决策模型的方法包括:The method of constructing a decision model includes:
    获取规则模板数据,并提取所述规则模板数据中的各个变量对象及各个模板样本;Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;
    对所述变量对象进行聚类分析,得到聚类结果;Perform cluster analysis on the variable object to obtain a clustering result;
    根据所述规则模板数据将所述聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征;Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;
    分别计算各个变量对象的黑样本概率,并将所述各个变量对象的黑样本概率作为第二特征;Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;
    通过所述第一特征与所述第二特征构建决策模型;Constructing a decision model by the first feature and the second feature;
    所述欺诈数据的识别方法包括:The method for identifying the fraud data includes:
    采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;
    基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据;Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;
    所述欺诈行为的识别方法包括:The method for identifying the fraud behavior includes:
    基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
    对所述关系网络中各个节点的群体性就医行为进行分析,以提取出各个节点对应的多维度群体就医特征;Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;
    将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率;Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;
    其中,所述目标数据包括待测试数据、规则模板数据和社保就诊数据中的至少两个。The target data includes at least two of data to be tested, rule template data, and social security visit data.
  14. 根据权利要求13所述的服务器,其特征在于,在所述通过所述第一特征与所述第二特征构建决策模型之前,所述构建决策模型的方法还包括:The server according to claim 13, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:
    按照预设算法将各个变量对象映射到预先定义的标签中;Mapping each variable object to a predefined tag according to a preset algorithm;
    根据所述规则模板数据将所述标签与各个模板样本进行匹配,并 将匹配后的标签作为第三特征;Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;
    所述通过所述第一特征与所述第二特征构建决策模型,具体包括:The constructing the decision model by using the first feature and the second feature includes:
    通过所述第一特征、所述第二特征及所述第三特征构建决策模型。A decision model is constructed by the first feature, the second feature, and the third feature.
  15. 根据权利要求14所述的服务器,其特征在于,所述通过所述第一特征、所述第二特征及所述第三特征构建决策模型,包括:The server according to claim 14, wherein the constructing the decision model by using the first feature, the second feature, and the third feature comprises:
    建立原始节点;Establish the original node;
    根据所述规则模板数据获取各个模板样本的结果类型;Obtaining a result type of each template sample according to the rule template data;
    分别遍历读取所述第一特征、所述第二特征及所述第三特征,生成读取记录;Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;
    根据所述各个模板样本的结果类型计算各条读取记录的分割纯度,并根据所述分割纯度确定分割点;Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;
    获取与所述分割点对应的特征,并建立新的节点。Obtaining features corresponding to the segmentation points and establishing new nodes.
  16. 根据权利要求13至15中任一项所述的服务器,其特征在于,所述对所述变量对象进行聚类分析,得到聚类结果,包括:The server according to any one of claims 13 to 15, wherein the clustering analysis is performed on the variable object to obtain a clustering result, including:
    从所述变量对象中随机选择多个变量对象分别作为聚类的第一聚类中心,每个第一聚类中心对应一个聚类;Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;
    分别计算各个变量对象到各个第一聚类中心的距离;Calculating the distance of each variable object to each first cluster center separately;
    根据计算结果对各个变量对象进行划分,将变量对象划分到距离最短的第一聚类中心对应的聚类中;Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;
    分别计算划分后的各个聚类的第二聚类中心;Calculating a second cluster center of each of the divided clusters separately;
    判断各个聚类中的第一聚类中心与第二聚类中心的距离是否小于预设阈值,若是,则将各个聚类作为聚类结果输出,若否,则将第二聚类中心替代对应的聚类的第一聚类中心,并继续执行所述分别计算各个变量对象到各个第一聚类中心的距离的步骤。Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
  17. 根据权利要求13所述的服务器,其特征在于,所述连续型反欺诈模型为直接连续型模型;The server according to claim 13, wherein said continuous anti-fraud model is a direct continuous model;
    所述采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型包括:The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:
    将预设的训练数据集按预设比例分解为训练集和测试集;Decomposing the preset training data set into a training set and a test set according to a preset ratio;
    保留所述测试集,按预设比例将所述训练集进一步分解为两个子训练集,所述两个子训练集分别作为下一层模型的训练集和测试集;Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;
    依次重复划分训练集至预设次数;Repeating the division of the training set to a preset number of times;
    分别利用划分的多层训练集,使用预设的经典模型来训练模型,并在保留的多层测试集上进行测试,建立直接连续型模型。Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.
  18. 一种构建决策模型的装置,其特征在于,包括:An apparatus for constructing a decision model, comprising:
    提取模块,用于获取规则模板数据,并提取规则模板数据中的各个变量对象及各个模板样本。The extraction module is configured to acquire rule template data, and extract each variable object and each template sample in the rule template data.
    聚类模块,用于对变量对象进行聚类分析,得到聚类结果。A clustering module is used for clustering analysis of variable objects to obtain clustering results.
    第一特征模块,用于根据规则模板数据将聚类结果与各个模板样本进行匹配,并将匹配后的聚类结果作为第一特征。The first feature module is configured to match the clustering result with each template sample according to the rule template data, and use the matched clustering result as the first feature.
    第二特征模块,分别计算各个变量对象的黑样本概率,并将各个变量对象的黑样本概率作为第二特征。The second feature module separately calculates the black sample probability of each variable object, and takes the black sample probability of each variable object as the second feature.
    构建模块,用于通过第一特征与第二特征构建决策模型。And a building module, configured to construct a decision model by using the first feature and the second feature.
  19. 一种欺诈数据的识别装置,其特征在于,包括:A device for identifying fraud data, comprising:
    建模模块,用于采用预设的连续型模型训练方式对预设的训练数据集进行训练,建立连续型反欺诈模型;The modeling module is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;
    识别模块,用于基于所述连续型反欺诈模型对待测试数据进行训练,识别所述待测试数据中的欺诈数据。And an identification module, configured to perform training on the test data based on the continuous anti-fraud model, and identify fraud data in the data to be tested.
  20. 一种欺诈行为的识别装置,其特征在于,包括:A device for identifying fraudulent behavior, comprising:
    建立模块,用于基于社保就诊数据建立医患、药诊的关系网络,其中,所述关系网络包括各个节点,各个节点之间隶属不同的关系;Establishing a module for establishing a relationship network between doctors and patients and a drug diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;
    分析提取模块,用于对所述关系网络中各个节点的群体性就医行 为进行分析,以提取出各个节点对应的多维度群体就医特征;An analysis extraction module is configured to analyze the group medical treatment behavior of each node in the relationship network, so as to extract the multi-dimensional group medical treatment characteristics corresponding to each node;
    输入识别模块,用于将提取的各个多维度群体就医特征输入到预设的分类模型,以根据所述分类模型识别出各个节点的欺诈率。The input identification module is configured to input the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model.
PCT/CN2018/077230 2017-07-24 2018-02-26 Anti-fraud identification method, storage medium, server carrying ping an brain and device WO2019019630A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710605531.7 2017-07-24
CN201710605531.7A CN107785058A (en) 2017-07-24 2017-07-24 Anti- fraud recognition methods, storage medium and the server for carrying safety brain

Publications (1)

Publication Number Publication Date
WO2019019630A1 true WO2019019630A1 (en) 2019-01-31

Family

ID=61437479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077230 WO2019019630A1 (en) 2017-07-24 2018-02-26 Anti-fraud identification method, storage medium, server carrying ping an brain and device

Country Status (2)

Country Link
CN (1) CN107785058A (en)
WO (1) WO2019019630A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021179907A1 (en) * 2020-03-11 2021-09-16 清华大学 Method and apparatus for generating risk assessment model and risk assessment method and apparatus

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428132B (en) * 2018-03-15 2020-12-29 创新先进技术有限公司 Fraud transaction identification method, device, server and storage medium
CN108038701A (en) * 2018-03-20 2018-05-15 杭州恩牛网络技术有限公司 A kind of integrated study is counter to cheat test method and system
CN108734479A (en) * 2018-04-12 2018-11-02 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108334647A (en) * 2018-04-12 2018-07-27 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108665270A (en) * 2018-04-17 2018-10-16 平安科技(深圳)有限公司 Data diddling recognition methods, device, computer equipment and storage medium
CN108876166A (en) * 2018-06-27 2018-11-23 平安科技(深圳)有限公司 Financial risk authentication processing method, device, computer equipment and storage medium
CN109003191A (en) * 2018-07-12 2018-12-14 上海金仕达卫宁软件科技有限公司 The anti-fraud template automatic generation method of medical treatment and system based on hierarchical clustering
CN109101562B (en) * 2018-07-13 2023-07-21 中国平安人寿保险股份有限公司 Method, device, computer equipment and storage medium for searching target group
CN109166030A (en) * 2018-08-01 2019-01-08 深圳微言科技有限责任公司 A kind of anti-fraud solution and system
CN109064032A (en) * 2018-08-06 2018-12-21 国网浙江杭州市临安区供电有限公司 The small micro- power honesty risk surveillance managing and control system of power supply station based on enterprise's cloud platform
CN109413031B (en) * 2018-08-31 2022-04-15 深圳壹账通智能科技有限公司 Anti-fraud model construction method, device, equipment and readable storage medium
CN109284371B (en) * 2018-09-03 2023-04-18 平安证券股份有限公司 Anti-fraud method, electronic device, and computer-readable storage medium
CN109242307B (en) * 2018-09-04 2022-02-01 中国光大银行股份有限公司信用卡中心 Anti-fraud policy analysis method, server, electronic device and storage medium
CN109409502A (en) * 2018-09-26 2019-03-01 深圳壹账通智能科技有限公司 Generation method, device, equipment and the storage medium of anti-fraud model
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109545312B (en) * 2018-10-23 2023-08-08 平安医疗健康管理股份有限公司 Drug store statement risk detection method and device
CN109599153B (en) * 2018-11-14 2021-06-29 金色熊猫有限公司 Medical data tracking method and device, storage medium and electronic equipment
CN109598628B (en) * 2018-11-30 2022-09-20 平安医疗健康管理股份有限公司 Method, device and equipment for identifying medical insurance fraud behaviors and readable storage medium
CN109816397B (en) * 2018-12-03 2021-05-25 北京奇艺世纪科技有限公司 Fraud discrimination method, device and storage medium
CN109919780A (en) * 2019-01-23 2019-06-21 平安科技(深圳)有限公司 Claims Resolution based on figure computing technique is counter to cheat method, apparatus, equipment and storage medium
CN110008349B (en) * 2019-02-01 2020-11-10 创新先进技术有限公司 Computer-implemented method and apparatus for event risk assessment
CN109903053B (en) * 2019-03-01 2020-01-07 成都新希望金融信息有限公司 Anti-fraud method for behavior recognition based on sensor data
CN109948806A (en) * 2019-03-28 2019-06-28 医渡云(北京)技术有限公司 Decision model optimization method, device, storage medium and equipment
CN110263106B (en) * 2019-06-25 2020-02-21 中国人民解放军国防科技大学 Collaborative public opinion fraud detection method and device
CN111047428B (en) * 2019-12-05 2023-08-08 深圳索信达数据技术有限公司 Bank high-risk fraud customer identification method based on small amount of fraud samples
CN111738747A (en) * 2020-06-24 2020-10-02 中诚信征信有限公司 Method and device for anti-fraud decision
CN113837874B (en) * 2021-11-22 2022-04-12 北京芯盾时代科技有限公司 Data identification method and device, storage medium and electronic equipment
CN115422016B (en) * 2022-11-05 2023-01-20 北京淇瑀信息科技有限公司 Data monitoring method and device based on server-side relation network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015002630A2 (en) * 2012-07-24 2015-01-08 Deloitte Development Llc Fraud detection methods and systems
CN104408547A (en) * 2014-10-30 2015-03-11 浙江网新恒天软件有限公司 Data-mining-based detection method for medical insurance fraud behavior
CN105279382A (en) * 2015-11-10 2016-01-27 成都数联易康科技有限公司 Medical insurance abnormal data on-line intelligent detection method
CN106384282A (en) * 2016-06-14 2017-02-08 平安科技(深圳)有限公司 Method and device for building decision-making model
CN107657453A (en) * 2016-07-25 2018-02-02 平安科技(深圳)有限公司 Cheat recognition methods and the device of data
CN107657536A (en) * 2017-02-20 2018-02-02 平安科技(深圳)有限公司 The recognition methods of social security fraud and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682067B (en) * 2016-11-08 2018-05-01 浙江邦盛科技有限公司 A kind of anti-fake monitoring system of machine learning based on transaction data
CN106600423A (en) * 2016-11-18 2017-04-26 云数信息科技(深圳)有限公司 Machine learning-based car insurance data processing method and device and car insurance fraud identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015002630A2 (en) * 2012-07-24 2015-01-08 Deloitte Development Llc Fraud detection methods and systems
CN104408547A (en) * 2014-10-30 2015-03-11 浙江网新恒天软件有限公司 Data-mining-based detection method for medical insurance fraud behavior
CN105279382A (en) * 2015-11-10 2016-01-27 成都数联易康科技有限公司 Medical insurance abnormal data on-line intelligent detection method
CN106384282A (en) * 2016-06-14 2017-02-08 平安科技(深圳)有限公司 Method and device for building decision-making model
CN107657453A (en) * 2016-07-25 2018-02-02 平安科技(深圳)有限公司 Cheat recognition methods and the device of data
CN107657536A (en) * 2017-02-20 2018-02-02 平安科技(深圳)有限公司 The recognition methods of social security fraud and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021179907A1 (en) * 2020-03-11 2021-09-16 清华大学 Method and apparatus for generating risk assessment model and risk assessment method and apparatus

Also Published As

Publication number Publication date
CN107785058A (en) 2018-03-09

Similar Documents

Publication Publication Date Title
WO2019019630A1 (en) Anti-fraud identification method, storage medium, server carrying ping an brain and device
US11475143B2 (en) Sensitive data classification
US11327975B2 (en) Methods and systems for improved entity recognition and insights
US20230073695A1 (en) Systems and methods for synthetic database query generation
Bolón-Canedo et al. Feature selection for high-dimensional data
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
US10217163B2 (en) Systems and methods for increasing efficiency in the detection of identity-based fraud indicators
US20180322572A1 (en) Systems and Methods for Improving Computation Efficiency in the Detection of Fraud Indicators for Loans
US10484413B2 (en) System and a method for detecting anomalous activities in a blockchain network
US11631032B2 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
US11100600B2 (en) Systems and methods for entity network analytics using geometric growth rate analysis
US11810000B2 (en) Systems and methods for expanding data classification using synthetic data generation in machine learning models
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
JP6892454B2 (en) Systems and methods for calculating the data confidentiality-practicality trade-off
Kim et al. Collaborative analytics for data silos
TW201539214A (en) A multidimensional recursive learning process and system used to discover complex dyadic or multiple counterparty relationships
Malhotra et al. An empirical comparison of machine learning techniques for software defect prediction
Bhandari et al. Data quality issues in software fault prediction: a systematic literature review
Robnik-Šikonja Dataset comparison workflows
US20210397905A1 (en) Classification system
CN116739764A (en) Transaction risk detection method, device, equipment and medium based on machine learning
CN112991079B (en) Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium
CN113988878B (en) Graph database technology-based anti-fraud method and system
Fernandes Synthetic data and re-identification risks
CN113723524B (en) Data processing method based on prediction model, related equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18838210

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18838210

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04/08/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18838210

Country of ref document: EP

Kind code of ref document: A1