CN117273451A

CN117273451A - Enterprise risk information processing method, device, equipment and storage medium

Info

Publication number: CN117273451A
Application number: CN202311268401.0A
Authority: CN
Inventors: 胡慧丽; 林苏燕; 李展; 顾丹铭
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-22

Abstract

The application provides an enterprise risk information processing method, device, equipment and storage medium, and relates to the field of finance or other related fields. The method comprises the following steps: basic attribute information, user behavior logs and buried point information data associated with the target enterprise risk are acquired; preprocessing basic attribute information to obtain basic attribute characteristics; constructing a multi-dimensional time sequence statistical feature according to the user behavior log and the buried point information data, wherein the multi-dimensional time sequence statistical feature is used for representing the matching relationship between the enterprise searched by the user in a fuzzy way and the enterprise searched by clicking; obtaining K neighbors according to the basic attribute characteristics and the multidimensional time sequence statistical characteristics, wherein the K neighbors represent K suspected matching relations in which the distance between the matching relation and a target enterprise meets a preset condition; and carrying out risk identification on the target enterprise according to the K suspected matching relations in the nearest neighbor set. The enterprise risk can be dynamically and timely identified when enterprise information is changed, and the risk management and control effect of a financial institution is effectively improved.

Description

Enterprise risk information processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the financial field or other related fields, and in particular, to a method, an apparatus, a device, and a storage medium for processing risk information of an enterprise.

Background

Enterprises can have various risks in the operation process, and in order to ensure the self-operation stability of financial institutions, the risks of the enterprises need to be prejudged in advance, and the risk degree is identified.

Currently, enterprise risk information identification and integration are generally implemented according to exact matching of enterprise names, enterprise certificates (uniform social codes, registration numbers, organization codes) and the like in the risk information.

However, the update frequency of the enterprise registration information is low, but in the actual operation process, the enterprise may have renaming scenes, and the risk information of the renamed enterprise cannot be timely identified by the prior art scheme, so that the risk pre-judgment error is easy to occur.

Disclosure of Invention

The application provides an enterprise risk information processing method, device, equipment and storage medium, which are used for solving the technical problem that the risk information of a renamed enterprise cannot be identified in time easily when the enterprise information is changed at present.

In a first aspect, the present application provides an enterprise risk information processing method, including:

basic attribute information, user behavior logs and buried point information data associated with the target enterprise risk are acquired;

preprocessing the basic attribute information to obtain basic attribute characteristics;

constructing a multi-dimensional time sequence statistical feature according to the user behavior log and the buried point information data, wherein the multi-dimensional time sequence statistical feature is used for representing the matching relationship between an enterprise searched by a user in a fuzzy way and an enterprise searched by clicking;

obtaining a nearest neighbor set according to the basic attribute characteristics and the multidimensional time sequence statistical characteristics, wherein the nearest neighbor set comprises K neighbors, K is a positive integer, and the K neighbors are used for representing K suspected matching relations, the distance between the K suspected matching relations and the target enterprise of which meets preset conditions, in the matching relations;

and carrying out risk identification on the target enterprise according to the K suspected matching relations in the nearest neighbor set.

In a second aspect, the present application provides an enterprise risk information processing apparatus, including:

the data acquisition module is used for acquiring basic attribute information, user behavior logs and buried point information data associated with the target enterprise risk;

the information preprocessing module is used for preprocessing the basic attribute information to obtain basic attribute characteristics;

The characteristic construction module is used for constructing a multi-dimensional time sequence statistical characteristic according to the user behavior log and the buried point information data, wherein the multi-dimensional time sequence statistical characteristic is used for representing the matching relation between an enterprise searched by a user in a fuzzy way and an enterprise searched by clicking;

the set acquisition module is used for acquiring a nearest neighbor set according to the basic attribute characteristics and the multidimensional time sequence statistical characteristics, wherein the nearest neighbor set comprises K neighbors, K is a positive integer, and the K neighbors are used for representing K suspected matching relations, the distance between the K suspected matching relations and the target enterprise of which meets preset conditions;

and the risk identification module is used for carrying out risk identification on the target enterprise according to the K suspected matching relations in the nearest neighbor set.

In a third aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the method as described above.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for performing a method as described above when executed by a processor.

According to the enterprise risk information processing method, device, equipment and storage medium, the basic attribute characteristics and the multidimensional time sequence statistical characteristics are combined, so that the matching relation between a certain enterprise and a click query enterprise is calculated and the nearest k neighbors of the accurate matching relation through the industrial and commercial registration information are searched, the suspected matching relation is positioned, the enterprise risk can be dynamically identified, and the risk management and control effect of a financial institution can be timely and efficiently improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of enterprise facing information provided in an embodiment of the present application;

fig. 2 is a flow chart of an enterprise risk information processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of data preprocessing according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of basic attribute features according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a multi-dimensional timing statistics feature provided in an embodiment of the present application;

FIG. 6 is an overall flowchart of enterprise risk information identification provided in an embodiment of the present application;

Fig. 7 is a schematic diagram of screening suspected matching relationships according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an enterprise risk information processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that the method, the device, the equipment and the storage medium for processing the enterprise risk information provided by the application can be used in the technical field of finance, and can also be used in any field except the technical field of finance.

Fig. 1 is a schematic diagram of the business coverage information provided in the embodiment of the present application, as shown in fig. 1, where the business coverage information is coverage information on a business license of a business, and generally includes information such as a business name, a unified social credit code, a type of the business, a legal representative, an operating range, a residence, a date of establishment, and a registered capital. When a financial institution (e.g., a bank) identifies risk information of an enterprise, the identification is usually performed by precisely matching an enterprise name, an enterprise certificate (uniform social code, registration number, organization code) in enterprise registration information with similar information in the risk information. The enterprise certificate is usually enterprise care information, but the update frequency of enterprise photo information is low (due to the pressure of a financial institution), and the enterprise renaming scene is faced, so that the risk information of the renamed enterprise cannot be timely identified by the current risk identification scheme of the financial institution, and thus loopholes can exist, and the situation that the enterprise cannot be controlled to be in substantial risk is caused. In addition, for massive unstructured risk information, it is also difficult for financial institutions to integrate various risk information entirely.

Aiming at the problems, the embodiment of the application provides an enterprise risk information processing method, device, equipment and storage medium, which can dynamically identify enterprise risk by utilizing big data analysis and machine learning technology, effectively integrate enterprise risk information and timely and efficiently promote risk management and control effects.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an enterprise risk information processing method provided in an embodiment of the present application, where the method may be applied to risk information identification of each enterprise user by a financial institution. As shown in fig. 2, the method specifically may include the following steps:

step S201, obtaining basic attribute information, user behavior log and buried point information data associated with the target enterprise risk.

In this embodiment, the basic attribute information is some information associated with the risk existence of the enterprise, and based on the basic attribute information, it can be used to infer whether the enterprise has the relevant risk. For example, the base attribute information includes business-facing information including, in particular, business name, uniform social code, business address, legal representatives, and the like. The user behavior log may record recent operational behavior of the user on the application software, such as search query behavior of the user in the last month, click behavior of the user in the last month, and so on. Buried point information data may refer to user related information collected on an application or website, such as a click operation (e.g., clicking a button) of a user on a page, and also, for example, a stay time of the user on a certain page (generally, the longer the stay time, the more interesting the user is to the content of the page). Specifically, the user behavior log and the buried point information data can be obtained from each business line, product line, each channel, each service version, and each customer group of the financial institution when the product is used.

In this embodiment, the certificate photos provided by the enterprises often have unique typesetting or format, and the complete enterprise photo information can be obtained only after the text is extracted and arranged by performing text recognition on the information in the certificate photos. The integrated enterprise care information obtained through arrangement can be used as the basic attribute information. And the user behavior log and the embedded point information data need to be acquired from the application program end.

Illustratively, in other embodiments, the base attribute information includes: at least one of enterprise care information, financing information, intellectual property information, external agency assessment information, and third party provider information.

The enterprise care information can comprise information such as enterprise names, unified social credit codes, types of enterprises, legal representatives, operating ranges, residences, establishment dates and the like; the financing information comprises issued bonds and other information; intellectual property information comprises information such as trademarks owned by enterprises, owned patents and the like; the external institution evaluation information includes authority data from other institutions other than the financial institution. The third party provider information includes information such as telecommunication fraud, complaint information, trust loss records, penalty information, management risk, etc.

Step S202, preprocessing basic attribute information to obtain basic attribute characteristics.

In this embodiment, the preprocessing may include operations such as data cleaning, information standardization processing, and data information splicing, and after the operations such as data cleaning, information standardization processing, and data information splicing, vectorization is performed on the basic attribute information to obtain a feature vector as a basic attribute feature.

For example, taking an example of an enterprise name obtained after preprocessing the basic attribute information, the preprocessed enterprise name may be mapped into a low-dimensional vector (for example, 64 bits), and then the 64bit hamming distance of the enterprise name in the user behavior log and the embedded point data information is compared, so as to obtain the similarity of the enterprise name, which is used as one of the features in the basic attribute feature information.

Step S203, constructing a multidimensional time sequence statistical feature according to the user behavior log and the buried point information data, wherein the multidimensional time sequence statistical feature is used for representing the matching relationship between the enterprise searched by the user in a fuzzy manner and the enterprise searched by clicking.

In this embodiment, the user behavior log and the buried point information data may be divided into different time periods, for example, the user behavior log and the buried point information data within approximately 90 days of the user, and the time sequence statistical feature of each time period is calculated based on the user behavior log and the buried point information data within approximately 90 days of the user. The multi-dimensional time sequence statistical characteristics such as the search times of each time period of a user, the click rate of each time period of the user, the stay time of the user on a page, the access depth of the user, the repeated search times of the user and the like can be constructed based on the user behavior log and the buried point information data of each service line, the product line, each channel, each service version and each customer group when the product is used by the financial institution, and the matching relation between the fuzzy search of the user and the click query of the enterprise can be described.

In other embodiments, the user behavior log and the embedded data include: the method comprises the steps of searching times of target corresponding relations, average searching times of target corresponding relations in a preset time period, target corresponding relation click rate, target corresponding relation average click rate, times of residence time of a user in a page in a first target time range interval, user access depth, user average access depth, repeated searching times of searching enterprises in a second target time range interval and page jump rate, wherein the target corresponding relations are used for representing corresponding relations between enterprises which are searched by users in a fuzzy mode and enterprises which are clicked and inquired by the users.

The target corresponding relation searching times can be divided into target corresponding relation searching times in about 10 days, about 20 days, about 1 month, about 60 days and about 90 days; the preset time period may be within about 10 days, about 20 days, about 1 month, about 60 days, or about 90 days, for example, the average search times of the target correspondence in the preset time period may be average search times of the target correspondence within about 10 days; the click rate of the target corresponding relation can be divided according to the time period, for example, the click rate of the target corresponding relation in the period of about 10 days, about 20 days, about 1 month, about 60 days and about 90 days; the average click rate of the target corresponding relation can also be divided according to the time period, for example, the average click rate of the target corresponding relation in the period of about 10 days, about 20 days, about 1 month, about 60 days and about 90 days can be obtained; the first target time zone may be a time zone of 0 to 5 seconds, a time zone of 5 to 12 seconds, a time zone of 12 to 30 seconds, or a time zone of 30 to 1 minute; the user access depth (within about 1 day, about 10 days, about 20 days, about 1 month, about 60 days, about 90 days), the user average access depth may also be divided according to the time period, such as the average access depth within about 1 day, about 10 days, about 20 days, about 1 month, about 60 days, about 90 days; the second target time range interval may be within 1 minute, within 2 minutes, within 5 minutes; the page jump rate may be divided according to the time period, for example, within about 1 day, within about 3 days, within about 5 days, within about 10 days, within about 20 days, and within about 1 month.

Step S204, obtaining a nearest neighbor set according to the basic attribute features and the multidimensional time sequence statistical features, wherein the nearest neighbor set comprises K neighbors, K is a positive integer, and the K neighbors are used for representing K suspected matching relations, the distance between the K neighbors and a target enterprise of the matching relations meets preset conditions.

Step S205, performing risk identification on the target enterprise according to K suspected matching relations in the nearest neighbor set.

In this embodiment, a KNN (k-nearest neighbor) model is constructed by combining the two major features (i.e., the basic attribute feature and the multidimensional time sequence statistical feature), and the nearest k neighbors of the matching relationship between a fuzzy search enterprise and a click query enterprise and the accurate matching relationship of the current through the business registration information are calculated, so that the suspected matching relationship is positioned, and the risk of the enterprise is dynamically identified.

The KNN model is a commonly used machine learning classification algorithm, which calculates the distance between one sample and all samples in the training set, finds out the K nearest training set samples, and then uses the most number of categories in the K samples as prediction results.

According to the embodiment of the application, the enterprise information integration model is constructed by utilizing big data analysis and machine learning technology, and information such as enterprise names, unified social credit codes, registration numbers, organization codes, tax-paying identification numbers, legal representatives, enterprise registration addresses, contact phones, operation ranges, issued bonds and the like is effectively resolved from massive unstructured risk information through word segmentation technology; based on the information, basic attribute characteristics such as register address similarity, enterprise name and legal representative person similarity, telephone similarity and the like are constructed; in addition, based on user behavior logs and buried point information data of each service line, each product line, each channel, each service version and each customer group when using the product, multidimensional time sequence statistical characteristics such as search times of each user in each period, click rate of each user in each period, residence time of the user on a page, access depth of the user, repeated search times of the user and the like are constructed, and a matching relation between a fuzzy search enterprise and a click query enterprise is described; and constructing a k neighbor model by combining the two major characteristics, calculating nearest k neighbors of the matching relation of the fuzzy search enterprise and the click query enterprise and the accurate matching relation of the current business registration information, and positioning the suspected matching relation, thereby dynamically identifying the enterprise client risk.

In some embodiments, the step S204 may be specifically implemented by the following steps: constructing a feature vector of each sample according to the basic attribute features and the multidimensional time sequence statistical features; according to cosine similarity of the feature vectors, determining the distance between each matching relation and the target enterprise; and acquiring a nearest neighbor set according to the distance between each matching relation and the target enterprise.

In this embodiment, feature vectors of each sample may be constructed according to the input basic attribute features and the time sequence statistical features, and the cosine similarity measure is used to perform fuzzy search on the correspondence between a certain enterprise and a click query on a certain enterprise and the distance between the enterprise in the business registration information, so as to generate a plurality of k nearest neighbor sets. The multidimensional time sequence statistical feature can be used for representing the corresponding relation between the fuzzy search of a certain enterprise and the clicking inquiry of a certain enterprise by a user.

In this embodiment, by calculating cosine similarity, similarity between the basic attribute feature and the multidimensional time sequence statistical feature is determined based on the cosine similarity, a plurality of enterprises (i.e., a plurality of k nearest neighbor sets) that are most matched with the corresponding relationship can be found in the business registration information, and the suspected matching relationship can be further located through the plurality of k nearest neighbor sets, so that the risk of the enterprises can be dynamically identified.

The cosine similarity is a measurement method for measuring the similarity between two vectors. It evaluates the similarity of two vectors by calculating their angle cosine values. The cosine similarity has a value ranging from-1 to 1, wherein 1 indicates that the two vectors are identical, 0 indicates that the two vectors are completely different, and-1 indicates that one vector is the opposite direction of the other vector.

According to the method and the device for searching the enterprise risk information, the cosine similarity measurement is utilized to search the corresponding relation between an enterprise and the enterprise and click to inquire the distance between the enterprise and the enterprise in the business registration information in a fuzzy mode, a plurality of k nearest neighbor sets are generated, and for enterprises with changed enterprise information such as renaming, the renamed enterprises can be accurately identified and the risk information of the enterprises can be accurately mastered, so that risk identification loopholes are avoided under scenes such as renaming of the enterprises, and operation safety of financial institutions is improved.

In some embodiments, the step S205 may be specifically implemented by the following steps: obtaining a prediction score of each suspected matching relation in the nearest neighbor set; integrating the prediction scores of each suspected matching relationship as a prediction space required by a multi-classifier of a gradient lifting decision tree algorithm, and screening out the suspected matching relationship with the highest prediction score as a target matching relationship; according to the target matching relationship, integrating information of the target enterprise to obtain an information integrated target enterprise; and carrying out risk identification on the target enterprise after information integration.

In this embodiment, in order to reduce the degree of excessive fitting of the decision tree algorithm to the maximum extent, a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) algorithm is introduced, the prediction scores of a plurality of k nearest neighbors are integrated as the prediction space required by the multi-classifier of the GBDT algorithm, and finally, the matching relationship with the highest score is screened out, so that the enterprise information integration is continuously improved. The core idea of the GBDT algorithm is to divide the original data set into a plurality of subsets, and then train each subset to obtain a weak learner. These weak learners are then combined to obtain a strong learner. This process may iterate until a preset number of iterations is reached or a certain stop condition is met.

The suspected matching relationship is used to represent whether a suspected association relationship exists between an enterprise searched by a user in a fuzzy manner and an enterprise clicked and queried by the user and an enterprise in business registration information (for example, enterprise A is actually renamed to enterprise B, but because the enterprise is not updated in time, the name of enterprise A in business registration information is not updated in time to enterprise B), at this time, the user may search for enterprise A through fuzzy search due to the occurrence of name change, and the end user determines that the enterprise B is clicked and queried through the supervisor. The suspected matching relationship with the highest predictive score is obtained by screening the suspected matching relationship, so that a final target enterprise can be further and accurately determined.

According to the embodiment of the application, the GBDT algorithm is introduced, the prediction scores of the k nearest neighbor sets are integrated as prediction spaces required by multiple classifiers of the GBDT algorithm, the matching relation with the highest score is screened out, the degree of excessive fitting of the decision tree algorithm can be reduced to the greatest extent, the situation of excessive fitting is avoided, and the accuracy of risk identification is further improved.

In some embodiments, when preprocessing the basic attribute information, the following steps may be specifically implemented: the basic attribute information is segmented through a word segmentation algorithm of the Chinese language model N-Gram, and information to be processed is obtained through extraction, wherein the information to be processed comprises first information to be processed and second information to be processed; data cleaning is carried out on the content in the first information to be processed; and carrying out standardization processing on the format of the second information to be processed according to a preset standard, wherein the standardization processing comprises at least one of address standardization and contact phone standardization.

Illustratively, the information to be processed includes at least one of business name, uniform social credit code, registration number, organization code, tax identifier, legal representative, business registration address, contact phone, and business scope. The first information to be processed can be enterprise name, operating range and legal representative name, and the second information to be processed can be unified social credit code, registration number, organization code, tax identifier, enterprise registration address and contact phone.

In this embodiment, the basic attribute information is usually in unstructured form, and by using word segmentation technology, information such as enterprise name, unified social credit code, registration number, organization code, tax administration identifier, legal representatives, enterprise registration addresses, contact phones, operation scope and the like can be effectively resolved from massive unstructured basic attribute information associated with enterprise risks. The base attribute information may also include bonds issued by the corporation. In addition, N-Gram is an algorithm based on a statistical language model, and is to perform sliding window operation with the size of N on the content in the text according to bytes, so as to form a byte fragment sequence with the length of N. Each byte segment is called a gram, statistics is carried out on the occurrence frequency of all the grams, filtering is carried out according to a preset threshold value, a key gram list, namely a vector feature space of the text, is formed, and each gram in the list is a feature vector dimension.

Fig. 3 is a schematic flow chart of data preprocessing provided in the embodiment of the present application, as shown in fig. 3, specifically including the following steps: step S301, data cleaning. Step S302, normalization processing. And step S303, splicing various data information.

In this embodiment, the data cleansing is mainly directed to data cleansing of enterprise names, business scopes, legal representatives names. The standardization process is mainly to carry out standardization process on unified social credit codes, registration numbers, organization codes and tax payment identification numbers according to related standards (such as national standards); address standardization and contact standardization. After the data cleaning and standardization processing are completed, various data information obtained by the processing can be spliced to form an integrated enterprise information file.

The information such as enterprise name, unified social credit code, registration number, organization code, legal representatives, enterprise registration address, contact telephone, operation range, issued bond, owned trademark, owned patent and the like can be extracted from the enterprise care information, financing information and intellectual property information. The word segmentation algorithm based on the N-gram model can also be used for extracting the information from the judge document, the enterprise annual newspaper, the bulletin document and the news document.

Further, when the data cleaning is performed on the enterprise name, the operation range and the legal representative name, the data cleaning can be realized by the following steps: deleting the blank space in the first information to be processed; performing full-angle to half-angle processing on a target symbol in the first information to be processed, wherein the target symbol comprises at least one punctuation mark, number and English; and converting lower-case English into upper-case English for the first information to be processed.

In this embodiment, when data cleaning is performed, the enterprise name, the operation range and the legal representative name are extracted from the enterprise coverage information, and because related invalid characters (such as spaces) or non-uniform character formats (such as English cases) may exist in the enterprise coverage information, the data cleaning can ensure that the extracted basic attribute information is more accurate and complete.

According to the embodiment of the application, the information extraction and pretreatment are carried out by applying the natural language processing technology, so that the accuracy and completeness of basic attribute information associated with enterprises can be ensured, the problems that the existing risk information acquisition and integration management framework is immature and the system performance is not formed to support the development of digital wind control transformation are avoided. Meanwhile, the enterprise risk information can be integrated faster and more accurately through preprocessing, and the problems that the current internal and external data sources are numerous, the standards are different, the internal and external data fusion is difficult and the like are avoided.

In some embodiments, the base attribute features may be obtained by: acquiring enterprise name similarity, enterprise address similarity, enterprise operation scope similarity, whether enterprise representatives are consistent, whether contact phones are consistent and whether enterprise registration information is consistent as first features; obtaining whether virtual products issued by target enterprises are consistent or not as second characteristics; acquiring whether product patterns owned by enterprises are consistent and whether owned product technologies are consistent as a third characteristic; the first feature, the second feature, and the third feature are taken as basic attribute features.

In this embodiment, the basic attribute features include three types of a first feature (which may be expressed as an enterprise basic feature), a second feature (which may be expressed as a financing feature), and a third feature (which may be expressed as an intellectual property feature), where similarity or consistency may be determined by a comparison manner, for example, taking the case of obtaining enterprise name similarity, enterprise address similarity, and business scope similarity, and mapping the enterprise name, the enterprise address, and the enterprise business scope into low-dimensional vectors by a local sensitive hash algorithm; and comparing the normalized low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping in the user behavior log and the embedded point information data with the Hamming distance of the low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping of the target enterprise to obtain the similarity of the enterprise name, the similarity of the enterprise address and the similarity of the enterprise operation range.

By way of example, fig. 4 is a schematic diagram of a basic attribute feature provided in an embodiment of the present application, and as shown in fig. 4, the basic attribute feature is exemplified by a similarity of enterprise names, a similarity of enterprise addresses, a similarity of enterprise operation ranges, a similarity of enterprise representatives, and the like, where the similarity of enterprise names, the similarity of enterprise addresses, and the similarity of enterprise operation ranges may use a locally sensitive hash algorithm, map the normalized enterprise names, enterprise addresses, and enterprise operation ranges to 64 bits, and compare the user behavior logs and 64bit hamming distances between similar information in buried data and similar information in industrial and commercial registration information, so as to obtain the similarity of enterprise names, the similarity of enterprise addresses, and the similarity of enterprise operation ranges.

Fig. 5 is a schematic diagram of a multi-dimensional time sequence statistical feature provided in an embodiment of the present application, and as shown in fig. 5, the multi-dimensional time sequence statistical feature at least includes a correspondence searching number, a correspondence clicking rate, a residence time of a user on a page, a user access depth, and a repeated searching number. The corresponding relationship refers to a matching relationship between an enterprise searched by a user in a fuzzy manner and an enterprise searched by clicking.

According to the embodiment of the application, the KNN model is built by combining the basic attribute characteristics and the multidimensional time sequence statistical characteristics of the enterprise, the nearest k neighbors of the matching relation of a certain enterprise and the click query of the certain enterprise and the accurate matching relation of the current business registration information can be calculated, and the suspected matching relation is positioned, so that the risk of an enterprise client is dynamically identified, the integration of enterprise risk information is effectively carried out, and the risk management and control effect is timely and efficiently improved.

Fig. 6 is an overall flowchart of enterprise risk information identification provided in an embodiment of the present application, as shown in fig. 6, including the following steps: step S601, data preparation and data sample screening. Step S602, data preprocessing. Step S603, a multidimensional feature system is established. And step S604, constructing a KNN model.

In this embodiment, step S601 is a data preparation stage, mainly labeling data samples. Step S602 is a data preprocessing stage, mainly cleaning and standardizing data, and word segmentation is performed on unstructured text. Step S603 is feature engineering, and a multidimensional feature system is established. Step S604 is to construct a KNN model, and establish a suspected matching relationship.

Further, fig. 7 is a schematic diagram of screening suspected matching relationships provided in the embodiment of the present application, as shown in fig. 7, the method specifically includes the following steps: and S701, constructing a feature vector. In step S702, a metric distance is calculated. In step S703, a plurality of K nearest neighbor sets are generated. In step S704, integration is performed using multiple classifiers. Step S705, recommending N matching relations with highest scores.

In this embodiment, basic attribute features and time sequence statistical features are input, feature vectors of each sample are constructed, and a cosine similarity measure is utilized to perform fuzzy search on the corresponding relation between a certain enterprise and a click query on the enterprise and the distance between the enterprise in the business registration information, so as to generate a plurality of k nearest neighbor sets. In addition, in order to furthest reduce the degree of excessive fitting of the decision tree algorithm, a GBDT algorithm is introduced, prediction scores of a plurality of k nearest neighbor sets are integrated as prediction spaces required by multiple classifiers of the GBDT algorithm, and finally, the matching relation with the highest score is screened out, so that enterprise information integration is continuously perfected.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 8 is a schematic structural diagram of an enterprise risk information processing apparatus according to an embodiment of the present application, and as shown in fig. 8, the enterprise risk information processing apparatus 800 includes a data acquisition module 810, an information preprocessing module 820, a feature construction module 830, a collection acquisition module 840, and a risk identification module 850. The data acquisition module 810 is configured to acquire basic attribute information, user behavior logs, and buried point information data associated with a target enterprise risk. The information preprocessing module 820 is used for preprocessing the basic attribute information to obtain basic attribute characteristics. The feature construction module 830 is configured to construct a multidimensional time sequence statistical feature according to the user behavior log and the buried point information data, where the multidimensional time sequence statistical feature is used to characterize a matching relationship between an enterprise searched by a user in a fuzzy manner and an enterprise queried by clicking. The set obtaining module 840 is configured to obtain a nearest neighbor set according to the basic attribute feature and the multi-dimensional time sequence statistical feature, where the nearest neighbor set includes K neighbors, K is a positive integer, and the K neighbors are used to characterize K suspected matching relationships in the matching relationship, where a distance between the K neighbors and the target enterprise satisfies a preset condition. The risk identification module 850 is configured to perform risk identification on the target enterprise according to K suspected matching relationships in the nearest neighbor set.

Optionally, the user behavior log and the buried data include: the method comprises the steps of searching times of target corresponding relations, average searching times of target corresponding relations in a preset time period, target corresponding relation click rate, target corresponding relation average click rate, times of residence time of a user in a page in a first target time range interval, user access depth, user average access depth, repeated searching times of searching enterprises in a second target time range interval and page jump rate, wherein the target corresponding relations are used for representing corresponding relations between enterprises which are searched by users in a fuzzy mode and enterprises which are clicked and inquired by the users.

Optionally, the set acquisition module may specifically be configured to: constructing a feature vector of each sample according to the basic attribute features and the multidimensional time sequence statistical features; according to cosine similarity of the feature vectors, determining the distance between each matching relation and the target enterprise; and acquiring a nearest neighbor set according to the distance between each matching relation and the target enterprise.

Optionally, the risk identification module may specifically be configured to: obtaining a prediction score of each suspected matching relation in the nearest neighbor set; integrating the prediction scores of each suspected matching relationship as a prediction space required by a multi-classifier of a gradient lifting decision tree algorithm, and screening out the suspected matching relationship with the highest prediction score as a target matching relationship; according to the target matching relationship, integrating information of the target enterprise to obtain an information integrated target enterprise; and carrying out risk identification on the target enterprise after information integration.

Optionally, the information preprocessing module may specifically be configured to: the basic attribute information is segmented through a word segmentation algorithm of the Chinese language model N-Gram, and information to be processed is extracted and comprises first information to be processed and second information to be processed; data cleaning is carried out on the content in the first information to be processed; and carrying out standardization processing on the format of the second information to be processed according to a preset standard, wherein the standardization processing comprises at least one of address standardization and contact phone standardization.

Optionally, the information preprocessing module may specifically be configured to: deleting the blank in the first information to be processed; performing full-angle to half-angle processing on a target symbol in the first information to be processed, wherein the target symbol comprises at least one punctuation mark, number and English; and converting lower-case English into upper-case English for the first information to be processed.

Optionally, the information preprocessing module may specifically be configured to: acquiring enterprise name similarity, enterprise address similarity, enterprise operation scope similarity, whether enterprise representatives are consistent, whether contact phones are consistent and whether enterprise registration information is consistent as first features; obtaining whether virtual products issued by target enterprises are consistent or not as second characteristics; acquiring whether product patterns owned by enterprises are consistent and whether owned product technologies are consistent as a third characteristic; the first feature, the second feature, and the third feature are taken as basic attribute features.

Optionally, the information preprocessing module may specifically be configured to: mapping the enterprise name, the enterprise address and the enterprise operation range into a low-dimensional vector through a local sensitive hash algorithm; and comparing the normalized low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping in the user behavior log and the embedded point information data with the Hamming distance of the low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping of the target enterprise to obtain the similarity of the enterprise name, the similarity of the enterprise address and the similarity of the enterprise operation range.

The device provided in the embodiment of the present application may be used to perform the method in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the usage rate acquisition module may be a processing element that is set up separately, may be implemented in a chip of the above-described apparatus, or may be stored in a memory of the above-described apparatus in the form of program codes, and the functions of the usage rate acquisition module may be called and executed by a processing element of the above-described apparatus. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device 900 includes: at least one processor 901, memory 902, bus 903, and communications interface 904. Wherein: the processor 901, the communication interface 904, and the memory 902 perform communication with each other via the bus 903. The communication interface 904 is used to communicate with other devices. The communication interface comprises a communication interface for data transmission, a display interface or an operation interface for human-computer interaction, and the like. The processor 901 is configured to execute computer-executable instructions stored in a memory, and may specifically perform relevant steps in the methods described in the above embodiments.

Wherein the processor may be a central processing unit, or a specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs. And the memory is used for storing computer execution instructions. The memory may comprise high speed RAM memory or may also comprise non-volatile memory, such as at least one disk memory.

The present embodiment also provides a computer-readable storage medium having stored therein computer instructions which, when executed by at least one processor of an electronic device, perform the methods provided by the various embodiments described above.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the front and rear associated objects are an "or" relationship; in the formula, the character "/" indicates that the front and rear associated objects are a "division" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the above embodiments may be combined in any way, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, but should be considered as the scope of the description

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An enterprise risk information processing method, which is characterized by comprising the following steps:

2. The method of claim 1, wherein the user behavior log and the embedded data comprise: the method comprises the steps of searching times of target corresponding relations, average searching times of target corresponding relations in a preset time period, target corresponding relation click rate, target corresponding relation average click rate, times of residence time of a user in a page in a first target time range interval, user access depth, user average access depth, repeated searching times of searching enterprises in a second target time range interval and page jump rate, wherein the target corresponding relations are used for representing corresponding relations between enterprises which are searched by users in a fuzzy mode and enterprises which are clicked and inquired by the users.

3. The method of claim 1, wherein the obtaining the nearest neighbor set from the base attribute features and the multi-dimensional timing statistics comprises:

constructing a feature vector of each sample according to the basic attribute features and the multidimensional time sequence statistical features;

according to cosine similarity of the feature vectors, determining the distance between each matching relation and the target enterprise;

and acquiring a nearest neighbor set according to the distance between each matching relation and the target enterprise.

4. The method of claim 1, wherein the performing risk identification on the target enterprise according to K suspected matching relationships in the nearest neighbor set comprises:

obtaining a prediction score of each suspected matching relationship in the nearest neighbor set;

integrating the prediction scores of each suspected matching relationship as a prediction space required by a multi-classifier of a gradient lifting decision tree algorithm, and screening out the suspected matching relationship with the highest prediction score as a target matching relationship;

according to the target matching relationship, integrating information of the target enterprise to obtain an information integrated target enterprise;

and carrying out risk identification on the target enterprise after information integration.

5. The method of claim 1, wherein the preprocessing the base attribute information comprises:

the basic attribute information is segmented through a word segmentation algorithm of a Chinese language model N-Gram, and information to be processed is extracted and comprises first information to be processed and second information to be processed;

performing data cleaning on the content in the first information to be processed;

and carrying out standardization processing on the format of the second information to be processed according to a preset standard, wherein the standardization processing comprises at least one of address standardization and contact phone standardization.

6. The method of claim 5, wherein the data cleansing the content of the first information to be processed comprises:

deleting the blank space in the first information to be processed;

performing full-angle half-angle conversion on a target symbol in the first information to be processed, wherein the target symbol comprises at least one punctuation mark, number and English;

and converting lower-case English into upper-case English for the first information to be processed.

7. The method of claim 1, wherein the obtaining basic attribute information associated with the target enterprise risk comprises:

Acquiring enterprise name similarity, enterprise address similarity, enterprise operation scope similarity, whether enterprise representatives are consistent, whether contact phones are consistent and whether enterprise registration information is consistent as first features;

obtaining whether virtual products issued by target enterprises are consistent or not as second characteristics;

acquiring whether product patterns owned by enterprises are consistent and whether owned product technologies are consistent as a third characteristic;

and taking the first feature, the second feature and the third feature as the basic attribute features.

8. The method of claim 7, wherein the obtaining the business name similarity, the business address similarity, the business scope similarity comprises:

mapping the enterprise name, the enterprise address and the enterprise operation range into a low-dimensional vector through a local sensitive hash algorithm;

and comparing the normalized low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping in the user behavior log and the embedded point information data with the Hamming distance of the low-dimensional vector corresponding to the enterprise name, the enterprise address and the enterprise operation range mapping of the target enterprise to obtain the similarity of the enterprise name, the similarity of the enterprise address and the similarity of the enterprise operation range.

9. An enterprise risk information processing apparatus, comprising:

10. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

The memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 8.

11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 8.