CN118333236A

CN118333236A - Enterprise behavior fraud risk prediction method and device and electronic equipment

Info

Publication number: CN118333236A
Application number: CN202410623264.6A
Authority: CN
Inventors: 顾凌云; 张涛; 胡诗卉; 潘峻
Original assignee: Shanghai IceKredit Inc
Current assignee: Shanghai IceKredit Inc
Priority date: 2024-05-20
Filing date: 2024-05-20
Publication date: 2024-07-12

Abstract

The disclosure relates to an enterprise behavior fraud risk prediction method, an enterprise behavior fraud risk prediction device and electronic equipment, comprising the following steps: converting the structured data in the enterprise historical behavior data into a structured data text language with a uniform format according to a preset format of the structured data; and (3) inputting unstructured data in the structured data text language and the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, performing self-pretraining on the behavior fraud risk prediction large language model through a self-supervision learning mode, and performing fine tuning training by using a text corpus. The meaning of the data can be mined by utilizing the large language model for predicting the risk of behavioral fraud, and the limitation of single characteristic risk prediction is broken, so that the risk of behavioral fraud of enterprises is measured more comprehensively. And the structured data is directly converted into language, and the language is supplemented with unstructured text data, so that the structured data can be used as the input of the whole model without establishing two models. The behavioral fraud risk prediction large language model may output behavioral risk prediction results. Is convenient to understand and use.

Description

Enterprise behavior fraud risk prediction method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of behavior determination, in particular to a method and a device for predicting risk of enterprise behavior fraud and electronic equipment.

Background

When different enterprises predict the risk of behavioral fraud, the risk of the behavioral fraud of the enterprises is mainly judged through experience judgment and machine learning as no more perfect and unified behavioral data are supported. In judging the fraud risk of the enterprise empirically, because insufficient enterprise data is not available for quantitative evaluation, manually set rules are usually marked according to experience, so that a risk scoring card for the fraud of the enterprise is established; in the process of judging the risk of enterprise behavior fraud by machine learning, the machine learning model converts all enterprise information into data of numerical value and other types, so that the machine learning model is built, and the risk of enterprise behavior fraud is prejudged. Both of the above methods have the disadvantage of not being able to accurately predict the risk of fraud in an enterprise.

Disclosure of Invention

In order to solve the technical problem that the enterprise behavior fraud risk cannot be accurately predicted in the related art, the disclosure provides an enterprise behavior fraud risk prediction method, an enterprise behavior fraud risk prediction device and electronic equipment.

In a first aspect of embodiments of the present disclosure, there is provided an enterprise behavioral fraud risk prediction method, the method comprising:

converting the structured data in the enterprise historical behavior data into a structured data text language with a uniform format according to a preset format of the structured data;

And inputting the structured data text language and unstructured data in the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, wherein the behavior fraud risk prediction large language model performs self-pretraining in a self-supervision learning mode, and then performs fine tuning training by using a text corpus.

In one embodiment, the converting the structured data in the enterprise historical behavior data into the structured data text language with the uniform format according to the format preset for the structured data includes:

And converting the structured data in the enterprise historical behavior data into a structured data text language with a uniform format, wherein the structured data is named as data header and the key value is a data parameter value, according to a format preset for the structured data.

In one embodiment, the converting the structured data in the enterprise historical behavior data into a structured data text language in a unified format with the name of the structured data as a data header and the key as a data parameter according to a format preset for the structured data includes:

And converting the structured data in the enterprise historical behavior data into a structured data text language with a unified format, wherein the structured data is in a data head by the name of the structured data, a keyword is determined by the risk of behavior fraud, and a key value is a data parameter value according to a preset format of the structured data.

In one embodiment, the unstructured data includes at least one of the following dimensions:

the life span of the enterprise, the number of staff members, financial information, the type of enterprise, and related legal events.

In one embodiment, the inputting the structured data text language and unstructured data in the enterprise historical behavior data into a large language model for predicting the risk of behavioral fraud for fusion to obtain a prediction result of the risk of behavioral fraud for the enterprise includes:

inputting the unstructured data in the structured data text language and the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion, and obtaining behavior fraud risk scoring values for each dimension and behavior fraud risk prediction results of each behavior fraud risk scoring value output by the behavior fraud risk prediction large language model;

and sorting the corresponding behavioral fraud risk scoring value and the corresponding behavioral fraud risk prediction result of each dimension to obtain the enterprise behavioral fraud risk prediction result, wherein the behavioral fraud risk scoring value is positively correlated with the enterprise behavioral fraud risk.

In one embodiment, the inputting the unstructured data in the structured data text language and the enterprise historical behavior data into a large behavioral fraud risk prediction language model to be fused, to obtain a behavioral fraud risk score value for each dimension and a behavioral fraud risk prediction result of each behavioral fraud risk score value output by the large behavioral fraud risk prediction language model, includes:

Inputting the unstructured data in the structured data text language and the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain behavior fraud risk scoring values which are output by the behavior fraud risk prediction large language model and are aiming at each dimension, and carrying out normalization processing on each behavior fraud risk scoring value;

and finishing to obtain the large language model for predicting the behavioral fraud risk, outputting the behavioral fraud risk scoring values after normalization processing, and outputting a behavioral risk predicting result of each behavioral fraud risk scoring value.

In a second aspect of embodiments of the present disclosure, there is provided an enterprise behavioral fraud risk prediction apparatus, the apparatus comprising:

The conversion module is configured to convert the structured data in the enterprise historical behavior data into a structured data text language with a uniform format according to a format preset for the structured data;

The fusion module is configured to input the structured data text language and unstructured data in the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, wherein the behavior fraud risk prediction large language model performs self pre-training in a self-supervision learning mode, and then performs fine-tuning training by using a text corpus.

In one embodiment, the conversion module is configured to:

In one embodiment, the fusion module is configured to:

In a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to execute executable instructions in the memory to implement the method of any one of the first aspects.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

Converting the structured data in the enterprise historical behavior data into a structured data text language with a uniform format according to a preset format of the structured data; and (3) inputting unstructured data in the structured data text language and the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, performing self-pretraining on the behavior fraud risk prediction large language model through a self-supervision learning mode, and performing fine tuning training by using a text corpus. The meaning of the data can be mined by utilizing the large language model for predicting the risk of behavioral fraud, and the limitation of single characteristic risk prediction is broken, so that the risk of behavioral fraud of enterprises is measured more comprehensively. And the structured data is directly converted into language, and the language is supplemented with unstructured text data, so that the structured data can be used as the input of the whole model without establishing two models. The behavioral fraud risk prediction large language model may output behavioral risk prediction results. Is convenient to understand and use.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart illustrating a method for predicting risk of fraud in an enterprise, according to an example embodiment.

Fig. 2 is a flow chart illustrating one implementation of step S12 of fig. 1, according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating one implementation of step S121 of fig. 3 according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating an enterprise behavioral fraud risk prediction apparatus, according to an example embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Before introducing a method, a device and an electronic device for predicting risk of enterprise behavior fraud, which are provided by the present disclosure, prior art in related technical fields is introduced, and both the risk of judging enterprise behavior fraud through experience judgment and the risk of judging enterprise behavior fraud through machine learning relate to structured data and unstructured data. Wherein the structured data is numerical or categorical information, mainly financial information inside enterprises, etc. The numerical data does not need to change the data type, and can be subjected to rule application or model training. Whereas unstructured data is irregular or incomplete in data structure, without predefined data types. In a specific business, a large amount of information is expressed in text form (long text) to represent the fraud risk situation of small micro-enterprises in the fraud risk assessment of enterprises. Such as the publicly known queriable risk behavior data.

Further, for unstructured data processing, there are mainly two methods:

I. Directly converts the data into quantifiable indexes, such as 'whether legal litigation exists in the last 1 year', and then carries out subsequent model training. In particular, the statistical caliber of the field may be increased based on actual traffic, e.g., "number of legal litigation in recent 1 year".

Training the model by using a Natural Language (NLP) model, and converting the model into a numerical score. Taking TF-IDF as an example, taking all text data as a basic corpus, counting word frequencies of single data text and word frequencies of the whole corpus, establishing a model, and outputting the model as numerical values.

Therefore, the non-structured text needs to be processed for the establishment of the grading card, the structured data is fused, and a model is established again or corresponding rules are formulated. However, in the case of directly converting unstructured data (long text) into quantifiable indicators, the real risk cannot be captured due to the simplistic processing of the unstructured data. The open data on unstructured networks will be converted into variables that can be quantified, such as "number of legal executions of the last 1 year". Only the number of times is recorded, and the content cannot be completely recorded, so that important information is lost. Specifically, the type of the document, the court grade, the judgment result, the case setting time and the time from the present can not be obtained only by the times. Resulting in a single variable not being able to capture complete information in its entirety.

Further, the amount of public data on the network is large, the types are many, and the total amount cannot be measured. For example, small micro-businesses are located in different market types and are facing different risks. In the existing small micro-enterprise grading card, only a simple distinction is made according to the industry in which the customer is located, for example, the industry is a service industry, but the service object is mainly aimed at the real estate industry. In a single industry, the full scale cannot be measured.

Further, if too many features are transformed for full coverage, it is difficult to guarantee the model effect and resources are wasted. For example, establishing fields at different court levels can result in oversized data dimensions, sparse data (excessive 0 content), and subsequent need to do more feature screening work, thereby ensuring the model effect. The requirements for personnel are high, and computer resources are wasted.

For the establishment of a Natural Language (NLP) model, unstructured text needs to be processed, and firstly, the text class processing of the natural language model has high requirements on calculation and human resources. In the first step in the natural language model, the corpus encodes the text and converts it into a language understood by the computer. In order to ensure that the computer can recognize when the subsequent new text appears, the larger the corpus is, the better the corpus is, and the generalization of the model is ensured, but the requirement on the resources of enterprises is high. And, the subsequent model establishment requires more specialized technicians. Therefore, the resource requirement of enterprises is high.

Second, the score given by the Natural Language (NLP) model is poorly interpretable. The natural language model is complex, and as a result, it is difficult to explain to the service personnel in actual use, thereby affecting subsequent use. And, when the data is small, only expert scoring cards can be selected, and no other references exist. Thus, the behavioral fraud risk of the enterprise cannot be accurately predicted finally.

FIG. 1 is a flowchart illustrating a method for predicting risk of fraud in an enterprise, according to an example embodiment. As shown in fig. 1, the method includes the following steps.

In step S11, the structured data in the enterprise historical behavioral data is converted into a structured data text language in a unified format according to a format preset for the structured data.

Wherein structured data refers to data having a structure or format, such as data in a relational database. These data typically have fixed fields and attributes that can be easily queried and analyzed.

Wherein the unstructured data: unstructured data refers to data that is irregular in data structure or of a non-fixed format, such as text, pictures, audio, video, and the like. These data are not as easy to directly analyze and query as structured data.

In the disclosed embodiments, a unified format is determined that should be able to clearly represent critical information in the structured data while facilitating subsequent processing and analysis. This format may be a fixed text template or a text representation based on some coding scheme (e.g., JSON or XML). Wherein the preset format can be set in the device by human experience.

In the embodiment of the disclosure, the structured data to be converted is extracted from the database. And cleaning the extracted data, removing irrelevant information, abnormal values or error data, and ensuring the accuracy and consistency of the data. Mapping the cleaned data into a preset text template, and filling the value of each field into a corresponding position. And generating a corresponding structured data text language according to the mapping result.

The structured data is converted into a structured data text language of a unified format, all of which are represented in a unified text format, which facilitates subsequent data processing and analysis. Text formatted data is easier to fuse with other unstructured data because large language models typically process text data. The converted text data is easier to understand by large language models, which is very helpful for debugging and interpreting model predictions.

In step S12, the structured data text language and unstructured data in the enterprise historical behavior data are input into a large language model for predicting the risk of behavioral fraud for fusion, so as to obtain a prediction result of the risk of behavioral fraud of the enterprise.

The large language model for predicting the fraud risk carries out self pre-training in a self-supervision learning mode, and then a text corpus is utilized to carry out fine tuning training to obtain the large language model.

Among them, the behavioral fraud risk prediction large language model (Large Language Model, LLM) is a deep learning-based natural language processing model, typically containing billions or even trillions of parameters. They are trained on large amounts of enterprise text data to understand and generate natural language text.

Self-supervised learning enables a behavioral fraud risk prediction large language model to learn using the structure or features of the enterprise structured data and unstructured data themselves as supervisory signals. In self-supervised learning, the model may attempt to predict certain portions of the data. Fine-tuning training (Fine-tuning) is to migrate a large language model trained with a large amount of text data to a behavior fraud prediction task, and further train the model with enterprise behavior data to make the model better adapt to the behavior fraud prediction task.

In the embodiment of the disclosure, data fusion is particularly critical in enterprise behavior fraud risk prediction. This is because fraud may be hidden in a field of the structured data, may be in a text segment of the unstructured data, or both. By fusing the two data, the model can capture more fraud signals, thereby improving the accuracy of prediction.

In the embodiment of the disclosure, the structured data text language and unstructured data are preprocessed, such as irrelevant information removal, text cleaning, standardization and the like, so that the quality of the data input into the model is ensured. And further fusing the structured data text language and unstructured data into an input sequence. This may require the design of specific fusion strategies such as embedding structured data into unstructured text, or combining the two by specific coding means. And inputting the fused data into a large language model. The model uses its own parameters and structure to perform depth analysis and understanding on the input data. The large language model outputs a fraud risk prediction result through analysis of the fusion data. This result is typically a probability value or a class label that indicates the likelihood that a particular business activity is fraudulent. The prediction result can provide an important reference for enterprise decision making, and helps enterprises to find and cope with potential fraud risks in time.

According to the technical scheme, structured data in enterprise historical behavior data is converted into a structured data text language with a uniform format according to a format preset for the structured data; and (3) inputting unstructured data in the structured data text language and the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, performing self-pretraining on the behavior fraud risk prediction large language model through a self-supervision learning mode, and performing fine tuning training by using a text corpus. The meaning of the data can be mined by utilizing the large language model for predicting the risk of behavioral fraud, and the limitation of single characteristic risk prediction is broken, so that the risk of behavioral fraud of enterprises is measured more comprehensively. And the structured data is directly converted into language, and the language is supplemented with unstructured text data, so that the structured data can be used as the input of the whole model without establishing two models. The behavioral fraud risk prediction large language model may output behavioral risk prediction results. Is convenient to understand and use.

For example, the structured data may be converted to a format in which the "column name" + "is" + "a specific value" resulting in a converted structured data text language. For example, the number of business years is 8 years.

In one embodiment, referring to fig. 2, in step S12, the inputting the structured data text language and unstructured data in the enterprise historical behavior data into a large language model for predicting behavioral fraud risk for fusion to obtain an enterprise behavioral fraud risk prediction result includes:

In step S121, the structured data text language and unstructured data in the enterprise historical behavior data are input into a behavioral fraud risk prediction large language model to be fused, so as to obtain a behavioral fraud risk scoring value for each dimension and a behavioral fraud risk prediction result of each behavioral fraud risk scoring value output by the behavioral fraud risk prediction large language model;

where behavioral risk prediction results refer to specific evidence or reasons that a large language model uses to support its fraud risk score, typically include specific words, phrases or patterns in text, etc.

The structured data is first converted to text language so that it can be understood and processed by a large language model. At the same time, unstructured data in the enterprise historical behavioral data is also collected. Together, these data are input into a behavioral fraud risk prediction large language model. The large language model performs deep analysis and understanding on the input data through the internal neural network structure and parameters. It attempts to identify patterns, keywords or phrases that are related to fraud risk and calculates fraud risk scoring values for each dimension based on these features.

These scoring values reflect the model's estimate of the likelihood of fraud in each dimension, while behavioral risk prediction results provide specific evidence and reasons for the model to derive these scoring values.

In step S122, the behavioral fraud risk score value and the corresponding behavioral fraud risk prediction result corresponding to each dimension are sorted, so as to obtain the enterprise behavioral fraud risk prediction result.

Wherein the behavioral fraud risk score is positively correlated with the business behavioral fraud risk.

Wherein, the positive correlation of the behavioral fraud risk score value and the enterprise behavioral fraud risk can be understood that the greater the behavioral fraud risk score value, the higher the enterprise behavioral fraud risk; the smaller the behavioral fraud risk score value, the lower the enterprise behavioral fraud risk.

First, the fraud risk scoring values for each dimension may be aggregated and sorted, which may involve sorting, categorizing, or further processing the scoring values to more intuitively understand fraud risk conditions across different dimensions. At the same time, the behavior risk prediction results corresponding to each grading value are also organized, which is helpful for understanding how the model obtains the grading values and how the reliability of the grading values is.

And finally, generating enterprise behavior fraud risk prediction results according to the sorted data. The result may be a comprehensive fraud risk score or a detailed report including the score value for each dimension, behavioral risk prediction results, suggested measures, etc. These results will provide an important reference for enterprise decisions, helping enterprises to discover and cope with potential fraud risks in time.

In one embodiment, referring to fig. 3, in step S121, the step of inputting unstructured data in the structured data text language and the enterprise historical behavior data into a behavioral fraud risk prediction large language model to be fused, to obtain a behavioral fraud risk score value for each dimension and a behavioral fraud risk prediction result of each behavioral fraud risk score value output by the behavioral fraud risk prediction large language model includes:

in step S1211, the structured data text language and unstructured data in the enterprise historical behavior data are input into a behavioral fraud risk prediction large language model to be fused, so as to obtain behavioral fraud risk scoring values which are output by the behavioral fraud risk prediction large language model and are specific to each dimension, and each behavioral fraud risk scoring value is normalized;

Wherein the normalization process may normalize the behavioral fraud risk score value to between, for example, 0-10 or 0-100.

In step S1212, the large language model for predicting the behavioral fraud risk is obtained by sorting, and the behavioral fraud risk score values after normalization processing and the behavioral fraud risk prediction results of each behavioral fraud risk score value are output.

And determining a behavior risk prediction result of each dimension according to the behavior risk score value, the text language of the structured data and the characteristics of unstructured data in the enterprise historical behavior data, and determining the enterprise behavior fraud risk prediction result according to the behavior risk prediction result of each dimension.

The final enterprise behavioral fraud risk prediction results can be obtained as follows:

based on the information provided, the business may be scored as follows:

Business 8 years: the enterprise has existed for some time, which indicates that it has a certain stability and confidence in the market;

Staff 5: the number of staff is small, but this also means that the operation cost of the enterprise is relatively low;

The 2022 camping amount is 200 ten thousand, and the actual profit is 50 ten thousand: this indicates that the enterprise has a greater profitability, but a lower profit margin;

Service industry enterprises, providing financial planning: this indicates that the business field of the enterprise is relatively narrow, and a certain market risk may exist;

Legal convention delineating employee 12 months wages: this indicates that the enterprise is at some legal risk and may be faced with employee legal litigation.

Relates to legal events: a lose a lawsuit case with a local involvement of 50 ten thousand compels, which indicates that the enterprise has a certain legal wind test and can meet the legal litigation of clients.

Based on the above factors, the business is scored 7 points, indicating that the business is at some risk of behavioral fraud.

The embodiment of the disclosure solves the problem of analysis of unstructured data, reduces the loss of text information, and breaks the limitation of input data types. On unstructured data, the large language model used has natural advantages in the use and judgment of the text class (language class) itself. The method can understand the meaning behind the enterprise, breaks through the limitation of single characteristics, and can more comprehensively measure the actual situation of the enterprise. On the input of the model, the structured data is directly converted into language, and the unstructured text data is supplemented, so that the structured data can be used as the input of the whole model, and two models do not need to be established. The large language model uses a large number of public language libraries. When in subsequent use, only the packaged model is needed, and the requirements on the skills of staff and the enterprise cost are reduced. The output of the large language model increases the interpretability. The large language model itself uses a complex natural language model, but it can output the basis of judgment. The basis can be provided for business personnel, and is convenient for understanding and using. Meanwhile, the training of the large language model is based on the public data on the internet, so that timeliness is improved.

The embodiment of the disclosure also provides an enterprise behavioral fraud risk prediction apparatus, which includes:

a conversion module 410 configured to convert structured data in the enterprise historical behavioral data into a structured data text language in a unified format according to a format preset for the structured data;

The fusion module 420 is configured to input the structured data text language and unstructured data in the enterprise historical behavior data into a behavior fraud risk prediction large language model for fusion to obtain an enterprise behavior fraud risk prediction result, wherein the behavior fraud risk prediction large language model performs self pre-training in a self-supervision learning mode, and then performs fine-tuning training by using a text corpus.

In one embodiment, the conversion module 410 is configured to:

In one embodiment, the fusion module 420 is configured to:

And finishing to obtain the large language model for predicting the behavioral fraud risk, outputting the behavioral fraud risk scoring values after normalization processing, and outputting a behavioral risk predicting result of each behavioral fraud risk scoring value. The embodiment of the disclosure also provides an electronic device, including:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to execute executable instructions in the memory to implement the enterprise behavioral fraud risk prediction method of any of the preceding embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof.

Claims

1. A method for predicting risk of fraud in an enterprise, the method comprising:

2. The method according to claim 1, wherein the converting the structured data in the enterprise historical behavioral data into a structured data text language in a unified format according to a format preset for the structured data, comprises:

3. The method according to claim 2, wherein the converting the structured data in the enterprise historical behavior data into a structured data text language in a unified format with the name of the structured data as a data header and the key as a data parameter according to a format preset for the structured data, comprises:

4. A method according to claim 3, wherein the unstructured data comprises at least one of the following dimensions:

5. The method according to claim 1, wherein the inputting the structured data text language and unstructured data in the enterprise historical behavior data into a large language model for predicting behavioral fraud risk for fusion to obtain an enterprise behavioral fraud risk prediction result includes:

6. The method according to claim 5, wherein the inputting unstructured data in the structured data text language and the enterprise historical behavior data into a behavioral fraud risk prediction large language model for fusion, to obtain a behavioral fraud risk score value for each dimension and a behavioral fraud risk prediction result output by the behavioral fraud risk prediction large language model, includes:

7. An enterprise fraud risk prediction apparatus, the apparatus comprising:

8. The apparatus of claim 7, wherein the conversion module is configured to:

9. The apparatus of claim 8, wherein the conversion module is configured to:

10. An electronic device, comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to execute executable instructions in the memory to implement the method of any of claims 1-6.