CN115358481A - Early warning and identification method, system and device for enterprise ex-situ migration - Google Patents

Early warning and identification method, system and device for enterprise ex-situ migration Download PDF

Info

Publication number
CN115358481A
CN115358481A CN202211083680.9A CN202211083680A CN115358481A CN 115358481 A CN115358481 A CN 115358481A CN 202211083680 A CN202211083680 A CN 202211083680A CN 115358481 A CN115358481 A CN 115358481A
Authority
CN
China
Prior art keywords
enterprise
feature
data set
early warning
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211083680.9A
Other languages
Chinese (zh)
Inventor
谢国城
陈业强
徐少强
桂进军
曾庆发
廖小文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Eshore Technology Co Ltd
Original Assignee
Guangdong Eshore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Eshore Technology Co Ltd filed Critical Guangdong Eshore Technology Co Ltd
Priority to CN202211083680.9A priority Critical patent/CN115358481A/en
Publication of CN115358481A publication Critical patent/CN115358481A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Abstract

The invention discloses a method for early warning and identifying enterprise migrations, which belongs to the technical field of enterprise wind control in big data and AI, can objectively and accurately carry out quantitative prediction and evaluation on enterprise migrations, and has the characteristics of high accuracy, good robustness and excellent comprehensive generalization; the method comprises the following steps: acquiring qualified enterprise business and business information data, operator data and enterprise supporting policy related index data according to specified requirements by using a feature acquisition unit, and performing matching, splicing, cleaning and standardization processing to obtain an enterprise feature data set; selecting important feature items from all feature items of the enterprise feature data set by using a feature selector to generate an important feature item subset; fusing the important feature item subsets by using a feature fusion device to obtain a fusion feature data set; and randomly dividing the fusion characteristic data set into an enterprise characteristic data set for training and an enterprise characteristic data set for testing. The invention also discloses a system and a device for early warning identification.

Description

Early warning and identifying method, system and device for enterprise ex-business
Technical Field
The invention relates to the technical field of enterprise wind control in big data and AI, in particular to a method, a system and a device for early warning and identification of enterprise migration.
Background
With the development and the innovation of the system of social economy, enterprises move out of consideration of factors such as operation positioning, cost rise, environment, policy and the like. Although the migration outside the enterprise is a normal market behavior, for local governments, the migration outside the enterprise reduces financial taxes, is unfavorable for employment, influences the stable and healthy development of regional economy, further influences the growth of local GDP, and particularly influences the transformation and upgrading of regional industry in high and new technology industries. At present, the local government has a certain lag in acquiring the ex-enterprise migration messages, cannot pre-judge in advance, and proposes a targeted coping measure. Therefore, a method for effectively identifying potential foreign-risk enterprises is urgently needed by local governments at present.
Although the traditional machine learning method is used for mining potential migrant enterprises in the industry at present, the method mainly comprises the steps of extracting enterprise business data, simply preprocessing the enterprise business data, obtaining a mining model by adopting a conventional and single machine learning algorithm and a training mode, for example, obtaining the model by adopting a logistic regression and random forest method, and finally outputting the probability of about to run off of the enterprises, wherein the prediction effect is not ideal finally. Therefore, related government departments cannot quickly and accurately mine possible enterprises in advance, and meanwhile, cannot make response measures in advance and make scientific and reasonable policies.
Therefore, a method for early warning and identifying the enterprise migrations is urgently needed to be designed so as to better provide accurate prejudgment for local governments.
Disclosure of Invention
The invention aims to provide an early warning and identifying method for enterprise migrations, which can objectively and accurately carry out quantitative prediction and evaluation on enterprise migrations and has the characteristics of high accuracy, good robustness and excellent comprehensive generalization.
The second purpose of the invention is to provide a system for early warning and identifying the enterprise migration, which can effectively help the early warning and identifying the enterprise migration risk. It is a third object of the present invention to provide means for implementing the early warning identification system.
The first technical scheme adopted by the invention is as follows:
a method for early warning and identifying enterprise migrations comprises the following steps:
(1) Acquiring qualified enterprise and business information data, operator data of the enterprise and enterprise supporting policy related index data of the location of the enterprise according to specified requirements by using a feature collector, matching, splicing, cleaning and standardizing to obtain an enterprise feature data set, and matching the existing ex-situ enterprise example with the enterprise feature data set;
(2) Selecting important feature items from all feature items of the enterprise feature data set by using a feature selector to generate an important feature item subset;
(3) Fusing the important feature item subsets by using a feature fusion device to obtain a fusion feature data set;
(4) Randomly dividing the fusion characteristic data set into an enterprise characteristic data set for training and an enterprise characteristic data set for testing;
(5) Inputting the enterprise characteristic data set for training into an algorithm strategy module to obtain an early warning identification model, inputting the enterprise characteristic data set for testing into the early warning identification model to obtain an external migration probability set of each enterprise for testing, and calculating the evaluation value of the obtained early warning identification model;
(6) And selecting the early warning identification model with good evaluation value obtained by training, and inputting the enterprise characteristic data needing early warning identification into the selected early warning identification model to obtain an output value, namely the ex-transition probability of the enterprise.
Further, the step (1) comprises the following steps:
(1.1) starting a characteristic collector, calling a multi-source data interface, and acquiring enterprise codes, enterprise names, unified social credit codes, enterprise types, latest industry types of enterprises, registered capital, real payment capital, operation duration, operation range, social security number, external investment times, tax payment A-level and other dimensional index data of national enterprises from enterprise business information data;
(1.2) acquiring dimension index data such as certificate numbers, fixed telephone installation numbers, broadband installation numbers, last-month fixed telephone installation numbers, last-month broadband installation numbers, new fixed telephone installation numbers, new broadband installation numbers, new fixed telephone dismantling numbers, broadband dismantling numbers, fixed telephone moving numbers, broadband moving numbers and the like of enterprise storage clients from operator data;
(1.3) acquiring index data related to enterprise support policies of the locations of the enterprises from a policy data source, matching the industrial and commercial enterprise portrait data with the operator enterprise portrait data according to social unified credit codes, and matching the index data related to the enterprise support policies of the locations of the enterprises according to the locations of the enterprises to obtain a spliced original feature data set;
(1.4) cleaning and standardizing the spliced original characteristic data set to obtain an enterprise characteristic data set;
and (1.5) matching the existing ex-place enterprise examples with the enterprise characteristic data set, marking the successfully matched enterprises as positive examples, and marking the unsuccessfully matched enterprises as negative examples.
Further, the step (1) further comprises the following steps:
(1.6) carrying out time processing based on manual experience, data binning processing, combined classification processing and Feature Tools module calling based on automatic integration on the original Feature data set by using a Feature generator;
and (1.7) carrying out One-HotEncoding processing or LDA processing or cosine similarity processing or neural network processing on the original feature data set by using a feature extractor.
Further, the step (2) comprises the following steps:
(2.1) selecting an optimal feature item from all feature items of the enterprise feature data set as a selected set;
(2.2) selecting an optimal feature item from the feature items in the enterprise feature data set except the selected set, adding the feature item into the selected set, and calculating the gain value of the selected set at the moment;
(2.3) evaluating the gain value of the selected set, and if the gain value of the selected set is not the maximum, repeating the step (2.2);
and (2.4) combining the data corresponding to each feature item in the enterprise feature data set and the selected set to generate an important feature item subset.
Further, in the step (2.2), the gain value is calculated according to the formula:
Figure BDA0003834206540000041
wherein Gain (A) is the Gain value of the selected set, ent (D) is the information entropy of the enterprise characteristic data set, D is the enterprise characteristic data set, V is the subset number of the enterprise characteristic data set, and D v A vth subset of the enterprise trait dataset;
the formula of the information entropy Ent (D) is:
Figure BDA0003834206540000042
wherein p is k The sample number is the ratio of the kth sample in the enterprise characteristic data set, and n is the number of subsets in the enterprise characteristic data set.
Further, in the step (5), a calculation formula of the evaluation value of the early warning identification model is:
Figure BDA0003834206540000043
wherein, F1 is an evaluation value, precision is a proportion of the number of true positive samples in the samples predicted to be positive by the early warning identification model, and the Recall value is a proportion of the number of true positive samples in the actual samples.
Further, in the step (6), the output value range is 0 to 1, the enterprise migration risk is low risk when the output value is less than 0.25, the enterprise migration risk is medium risk when the output value is greater than or equal to 0.25 and less than 0.5, the enterprise migration risk is medium risk when the output value is greater than or equal to 0.5 and less than 0.75, and the enterprise migration risk is extremely high risk when the output value is greater than or equal to 0.75 and less than 1.
The second technical scheme adopted by the invention is as follows:
a system for early warning identification of enterprise migrations, comprising:
a characteristic collector: the system is used for acquiring, cleaning and standardizing enterprise industry and commerce information data needing early warning identification, enterprise portrait data of operators and enterprise support policy related index data of locations of enterprises;
a feature selector: the characteristic collector is used for collecting the characteristics output by the characteristic collector;
a feature fusion device: the device is used for fusing the characteristics output by the characteristic collector and the characteristics output by the characteristic selector;
an algorithm strategy module: the early warning recognition system is used for training a plurality of weak learners and fusing and constructing an early warning recognition model according to output results of the weak learners;
the output end of the characteristic collector is connected with the input end of the characteristic selector, the output end of the characteristic selector is connected with the input end of the characteristic fusion device, and the output end of the characteristic fusion device is connected with the input end of the algorithm strategy module.
Further, the method also comprises the following steps:
a feature generator: the device is used for creating new data characteristics by the data output by the characteristic collector in a model label prediction mode;
a feature extractor: the text data in the data output by the feature generator is used for feature extraction;
the output end of the feature collector is connected with the input end of the feature generator and the input end of the feature extractor respectively, and the output end of the feature generator and the output end of the feature extractor are both connected with the input end of the feature selector.
The third technical scheme adopted by the invention is as follows:
an apparatus for early warning of enterprise migration, comprising a memory storing a computer program and a processor, wherein the processor implements the steps of the method according to any one of claims 1-7 when executing the computer program.
Compared with the prior art, the invention has the following beneficial effects: high accuracy, good robustness and excellent comprehensive generalization performance.
1. According to the method for early warning and identifying the enterprise migrations, the characteristics suitable for the early warning and identifying model of the enterprise migrations are screened out from the complex enterprise samples by using the correlation analysis through an objective method, so that the subjectivity and limitation caused by manual screening are avoided, the accuracy is high, the robustness is good, and the comprehensive generalization performance is excellent;
by arranging the feature selector device, the most different information is obtained from the plurality of related original feature sets, redundant information generated due to the correlation among different feature sets is eliminated, and the model performance is improved; the characteristic diversity can be further improved, and the system device still has strong robustness under the condition of not calling or missing the system component;
by arranging the characteristic fusion device, the mapped sample set still has good separability, and meanwhile, the calculation time for solving the optimal solution by gradient descent is shortened, so that the model precision is improved by subsequent calculation, and the precision loss is reduced;
by adopting a model fusion method to carry out enterprise ex-situ prediction, the accuracy is higher, the model robustness is better, and the comprehensive generalization performance is better;
the set feature collector, the feature generator, the feature extractor, the feature fusion device and the algorithm strategy module are system components which work independently and run mutually without mutual interference.
2. According to the early warning and identifying system for the enterprise migration, the characteristic collector, the characteristic selector, the characteristic fusion device and the algorithm strategy module are arranged, all the modules are mutually cooperated, the system is a complete system with transplanting capability, and the early warning and identifying risk of the enterprise migration can be effectively helped.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of the present system;
FIG. 2 is a schematic structural diagram of a feature collector in the system;
FIG. 3 is a schematic diagram of the structure of a feature generator in the present system;
FIG. 4 is a schematic diagram of the structure of the feature extractor in the present system;
FIG. 5 is a schematic diagram of the structure of the feature selector in the present system;
FIG. 6 is a schematic diagram of the structure of the feature fuser in the present system;
fig. 7 is a schematic diagram of the structure of the algorithm policy module in the present system.
Detailed Description
The technical solution of the present invention will be further described in detail with reference to the following embodiments, but the present invention is not limited thereto.
The invention discloses a method for early warning and identifying the enterprise relocation, which comprises the following steps:
(1) And acquiring qualified enterprise business information data, operator data of the enterprise and related index data of an enterprise supporting policy in the location of the enterprise by using a feature acquisition unit according to specified requirements, matching, splicing, cleaning and standardizing to obtain an enterprise feature data set, and matching the existing ex-situ enterprise example with the enterprise feature data set.
Wherein, the step (1) comprises the following steps:
(1.1) starting a characteristic collector, calling a multi-source data interface, and obtaining dimensional index data of national enterprises, such as enterprise codes, enterprise names, unified social credit codes, enterprise types, latest industry types of the enterprises, registered capital, real payment capital, operation duration, operation range, social security number, external investment times, tax payment A level and the like from enterprise business information data.
And (1.2) acquiring dimension index data such as certificate numbers, fixed telephone installation numbers, broadband installation numbers, last-month fixed telephone installation numbers, last-month broadband installation numbers, new fixed telephone installation numbers, new broadband installation numbers, new fixed telephone dismantling numbers, broadband dismantling numbers, fixed telephone moving numbers, broadband moving numbers and the like of enterprise clients from operator data.
And (1.3) acquiring index data related to enterprise support policies of the enterprise locations from the policy data source, matching the enterprise portrait data of the industrial and commercial enterprises with the enterprise portrait data of the operators according to the social unified credit codes, and matching the index data related to the enterprise support policies of the enterprise locations according to the enterprise locations to obtain a spliced original feature data set.
And (1.4) cleaning and standardizing the spliced original characteristic data set to obtain an enterprise characteristic data set. Calling a characteristic preprocessing interface of the characteristic collector, and processing the missing value, the abnormal value, the discrete type and the numerical type characteristics of the characteristic data in the spliced original characteristic data set; the missing value processing mode includes but is not limited to mean filling, median filling and sample deleting; the abnormal value processing mode is a processing mode which is not limited to mean filling, median filling and sample deleting after the deviation data distribution is detected; the discrete feature processing mode comprises, but is not limited to LabelEncoding, one-HotEncoding and average number coding; the numerical characteristic processing method includes, but is not limited to, a section scaling, binarization, and normalization processing method.
(1.4) cleaning and standardizing the spliced original characteristic data set to obtain an enterprise characteristic data set;
and (1.5) matching the existing ex-place enterprise examples with the enterprise characteristic data set, marking the successfully matched enterprises as positive examples, and marking the unsuccessfully matched enterprises as negative examples.
Further, the method also comprises the following steps:
and (1.6) carrying out time processing based on manual experience, data binning processing, combined classification processing and Feature Tools module calling based on automatic integration on the original Feature data set by using a Feature generator.
The calling of the feature generator can specify a feature generation mode of selecting manual experience or a feature generation mode based on automatic integration through the form of parameters. The manual experience based time processing may classify time to generate features such as weekday, weekend or morning, afternoon, evening. The data binning processing is to perform binning analysis on selected feature data to generate corresponding feature classification data, for example, registered capital and operating duration fields of enterprises, and respectively calculate a minimum observed value (lower edge), a 25% quantile (Q1), a median, a 75% quantile (Q3), and a maximum observed value, namely, value domain interval divisions of [ minimum observed value, 25% quantile ], [25% quantile, median ], [ median, 75% quantile ], [75% quantile, and maximum observed value ] are obtained according to data point distribution, and respectively correspond to four classifications 1,2, 3, and 4. The combined classification processing is to perform feature crossing between fields, and includes but is not limited to a process of generating data features by operations such as addition, subtraction, multiplication and division, or weighting and the like; the FeatureTools module based on automatic integration automatically generates new features through conversion (Transformation) and Aggregation (Aggregation) operations according to the relation between data.
And (1.7) carrying out One-HotEncoding processing or LDA processing or cosine similarity processing or neural network processing on the original feature data set by using a feature extractor.
The feature extractor packages an LDA theme model for LDA processing, is used for extracting the theme type of the text of the enterprise supporting policy in the location of each enterprise, and performs One-HotEncoding processing on the type to obtain the enterprise supporting policy label. The LDA topic model recognizes that topics can be represented by a distribution of words and articles can be represented by a distribution of topics. The LDA theme is generated as follows: generation of a document theta sampled from a Dirichlet distribution alpha i The distribution of topics of; polynomial distribution from topic theta i Sampling to generate a theme z of a jth word of a document ith ij (ii) a From DiLekring's beta sampling to generate a topic z ij Distribution of words
Figure BDA0003834206540000101
Polynomial distribution from words
Figure BDA0003834206540000102
Intermediate sampling finally generates word w ij . Therefore, the obtained supporting policy of the location of the enterprise is matched with the type association of the enterprise, and the enterprise policy label is endowed with a relevant probability positive value in the matching process, otherwise, the value is 0.
The feature extractor packages a cosine similarity method for cosine similarity processing, and is used for calculating the correlation between the words in the enterprise operation category and the enterprise support policy high-frequency words at the location of the enterprise, and the correlation is taken as a new feature. The relevance is based on the text similarity of the Jacard similarity as follows:
Figure BDA0003834206540000103
wherein, A is the word of enterprise operation category, and B is the enterprise support policy high frequency word of the enterprise location. In general, word-level-based approximation calculation can be used for short texts, and word vectors based on natural language understanding can also be used for measurement.
(2) And selecting important characteristic items from all characteristic items of the enterprise characteristic data set by using a characteristic selector to generate an important characteristic item subset.
The method comprises the following steps:
and (2.1) selecting an optimal characteristic item from all the characteristic items of the enterprise characteristic data set as a selected set.
And (2.2) selecting an optimal characteristic item from the characteristic items in the enterprise characteristic data set except the selected set, adding the characteristic item into the selected set, and calculating the gain value of the selected set at the moment.
(2.3) evaluating the gain value of the selected set, and if the gain value of the selected set is not the maximum, repeating the step (2.2).
And (2.4) combining the data corresponding to each feature item in the enterprise feature data set and the selected set to generate an important feature item subset.
In a given feature set a 1 ,a 2 ,…,a n Firstly, an optimal feature item is selected, such as { a } 2 As the first round selection set. Then, a feature item is added on the basis of the feature item, and a candidate subset containing two feature items is constructed, such as { a } 2 ,a 4 And selecting the optimal dual feature subset as a second round of selected subsets, and so on until no more optimal feature subset can be found. Under such a search strategy, for a data set D, assume that the proportion of the ith type sample in D is p i (i =1,2, \8230;, n), the information entropy is defined as:
Figure BDA0003834206540000111
for attribute subset A, assume that D is set to V subsets { D) according to its value 1 ,D 2 ,…,D V And the samples of each subset have the same value on a, so we can calculate the information gain of attribute subset a as:
Figure BDA0003834206540000112
the larger the information Gain (a), the more information the feature subset a contains that contributes to classification. Then, for each candidate feature subset, we compute its information gain value based on the trainable data set D and use this as an evaluation criterion. Combining the feature subset search mechanism and the subset evaluation mechanism is the basic principle of the feature selector.
The feature selector encapsulates Method calls for various feature selections including, but not limited to, filter Method (Filter Method), wrapper Method (Wrapper Method), and embedded Method (Embedding Method).
S21, the filtering method is to select the characteristics of the data setAlternatively, the data is then input for training the model. The Relief (Relevant Features) method designs a "correlation statistic" to measure the importance of Features. The statistic is a vector whose components each correspond to an initial feature, and the importance of the subset of features is determined by the sum of the components of the relevant statistic corresponding to each feature in the subset. Given training set { (x) 1 ,y 1 ),(x 21 ,y 2 ),…,(x m ,y m ) For each instance x i At x i Searching nearest neighbor x in the same kind of sample i,nh Called "neighbor-hit", and from x i In the heterogeneous sample of (2) to find its nearest neighbor x i,nm Called "false-guess neighbors" (near-miss), and the component of the attribute j corresponding to the correlation statistic is
Figure BDA0003834206540000121
Wherein
Figure BDA0003834206540000122
Represents x a The value taken on the attribute j is,
Figure BDA0003834206540000123
depending on the type of attribute j: if the attribute j is discrete, then
Figure BDA0003834206540000124
Time-piece
Figure BDA0003834206540000125
Otherwise, the value is 1; if the attribute j is continuous, then
Figure BDA0003834206540000126
Wherein
Figure BDA0003834206540000127
Has been normalized to [0,1 ]]An interval.
S22, the packaged method directly takes the performance of the finally used learner as the evaluation index of the feature, namely the purpose of the packaged feature selection method is to select the most favorable feature subset of the performance of the given learner. The invention provides a wrapping type feature selection method, which is characterized in that subset search is carried out by using a random strategy under a Las Vegas method (Las Vegasmehod) framework, and the error of a final classifier is used as a feature subset evaluation criterion. The Las Vegas algorithm adopts a random search strategy in the feature subset, and sets the stop condition control parameters, thereby avoiding the problem that the operation cannot be stopped for a long time when the parameters are overlarge.
S23, the embedded method is characterized in that a feature selection process and a learner training process are fused and completed in the same optimization process, namely, feature selection is automatically performed in the learner training process. For a given data set
D={(x 1 ,y 1 ),(x 2 ,y 2 ),…,(x m ,y m )}
Wherein x ∈ R d Y ∈ R, in one embodiment of the regression model of the present invention, the optimization objective is
Figure BDA0003834206540000128
Wherein the regularization parameter λ>0, | w | | is L 1 Normalized norm, L employed by the invention 1 Norm ratio L 2 The norm is easier to obtain a sparse solution, i.e. the found w will have fewer non-zero solutions.
(3) And fusing the important feature item subsets into a fused feature data set by using a feature fusion device.
The feature fusion device encapsulates the add fusion method and the concat fusion method. The add fusion method is a parallel strategy, combining the two feature vectors into a complex vector, i.e. for the input feature x 1 And x 2 ,z=x 1 +ix 2 Wherein i is an imaginary unit; the concat fusion method is to directly connect two features, if the dimensions of x and y of two input features are p and q, the output featureThe dimension of sign z is p + q.
Further, the feature fusion engine may also encapsulate mid-fusion and back-end fusion on the basis of front-end fusion.
(4) And randomly dividing the fusion characteristic data set into an enterprise characteristic data set for training and an enterprise characteristic data set for testing. Wherein, the data item ratio of the enterprise characteristic data set for training and the enterprise characteristic data set for testing is 1-4: 1.
(5) And inputting the fusion feature data set into an algorithm strategy module to obtain an early warning identification model, inputting the feature data set of the enterprise for testing into the early warning identification model to obtain an external migration probability set of each enterprise for testing, and calculating the evaluation value of the obtained early warning identification model.
The algorithm strategy module trains a plurality of basic learning models by using a K-fold cross validation method, and further fusion is carried out according to output results of the basic learning models, wherein the fusion result is the predicted enterprise migration probability. Specifically, the algorithm strategy module divides the received characteristic data into K equal parts, preferably K =10, of the characteristic data set. The model of each weak learner is used for training K-1 data sets, the rest data sets are used as test sets, and prediction results of all the weak learners are used as training sets and used as input of a fusion model. In particular, the weak learner may be a random forest model, a decision tree model, a support vector machine model, or may be a deep neural network model. The early warning identification method selects a K-fold cross validation method, and the models of logistic regression, lightGBM, random forest, neural network and the like are respectively adopted for traversing and selecting K-1 data in a weak learner; and selecting logistic regression in the fusion learning layer for fusion, and taking model prediction results trained by the previous K weak learners as input of a second layer fusion model. Generally, K > =3, and is a positive integer.
The calculation formula of the evaluation value of the early warning identification model is as follows:
Figure BDA0003834206540000141
wherein, F1 is an evaluation value, and is a harmonic mean of a Precision value and a Recall value, the Precision value is a proportion of the number of true positive samples in the samples predicted to be positive by the early warning identification model, and the Recall value is a proportion of the number of true positive samples in the actual samples.
(6) And selecting an early warning identification model with good evaluation value, and inputting the enterprise characteristic data needing early warning identification into the selected early warning identification model to obtain an output value, namely the enterprise migratory probability.
The output value range is 0-1, when the output value is less than 0.25, the enterprise migration risk is low risk, when the output value is more than or equal to 0.25 and less than 0.5, the enterprise migration risk is medium risk, when the output value is more than or equal to 0.5 and less than 0.75, the enterprise migration risk is medium risk, and when the output value is more than or equal to 0.75 and less than 1, the enterprise migration risk is high risk.
Output value example results are as follows:
enterprise coding Probability of migrations Probability of non-migrant Label Label
10**21 0.121 0.879 Low risk of extrinsic migration
10**32 0.622 0.378 Risk of middle and high external migration
10**13 0.11 0.89 Low risk of extrinsic migration
10**44 0.21 0.79 Low risk of extrinsic migration
10**15 0.961 0.039 Extremely high risk of migratory
10**75 0.101 0.899 Low risk of extrinsic migration
According to the method for early warning and identifying the enterprise migrations, the characteristics suitable for the early warning and identifying model of the enterprise migrations are screened out from the complex enterprise samples by using the correlation analysis through an objective method, so that the subjectivity and limitation caused by manual screening are avoided, the accuracy is high, the robustness is good, and the comprehensive generalization performance is excellent;
by arranging the feature selector device, the most different information is obtained from the related original feature sets, redundant information generated due to the correlation among different feature sets is eliminated, and the model performance is improved; the characteristic diversity can be further improved, and the system device still has strong robustness under the condition of not calling or missing the system component;
by arranging the characteristic fusion device, the mapped sample set still has good separability, and meanwhile, the calculation time of solving the optimal solution by gradient descent is shortened, so that the model precision is improved by subsequent calculation, and the precision loss is reduced;
by adopting a model fusion method to carry out ex-enterprise migration prediction, the accuracy rate is higher, the model robustness is better, and the comprehensive generalization performance is better;
the set feature collector, the feature generator, the feature extractor, the feature fusion device and the algorithm strategy module are system components working independently and run mutually without mutual interference.
Referring to fig. 1 to 7, the system for early warning and identifying the enterprise migration of the present invention includes:
a characteristic collector: the system is used for acquiring, cleaning and standardizing enterprise industry and commerce information data needing early warning identification, enterprise portrait data of operators and enterprise support policy related index data of locations of enterprises;
a feature selector: the feature selector is used for selecting important features from the features output by the feature collector, and aims to select the important features, eliminate dimension disasters of the features, improve model training efficiency and reduce the risk of overfitting.
A feature fusion device: the method is used for fusing the features output by the feature collector and the features output by the feature selector, and the feature fusion device is constructed to realize advantage complementation of multiple features and obtain a better robustness and accuracy recognition result for the early warning recognition model.
An algorithm strategy module: the early warning identification module is used for training a plurality of weak learners and fusing and constructing an early warning identification model according to output results of the weak learners, and the algorithm strategy module is constructed to improve the accuracy rate and precision rate of the algorithm on a model layer.
The output end of the feature collector is connected with the input end of the feature selector, the output end of the feature selector is connected with the input end of the feature fusion device, and the output end of the feature fusion device is connected with the input end of the algorithm strategy module.
According to the early warning and identifying system for the ex-enterprise migration, the characteristic collector, the characteristic selector, the characteristic fusion device and the algorithm strategy module are arranged, all the modules are mutually cooperated, the system is a complete system with transplanting capability, and the early warning and identifying system for the ex-enterprise migration risk can effectively help.
Further, the method also comprises the following steps:
a feature generator: the feature generator is used for creating new data features by the mode of model label prediction of the data output by the feature collector, and can perform deep processing on feature data sets output by the feature collector.
A feature extractor: the feature extractor can extract features of the text data in the spliced original feature data set.
The output end of the feature collector is connected with the input end of the feature generator and the input end of the feature extractor respectively, and the output end of the feature generator and the output end of the feature extractor are both connected with the input end of the feature selector.
The device for early warning and identifying the enterprise migration comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the method.
The above description is only exemplary of the invention, and any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention should be considered within the scope of the present invention.

Claims (10)

1. A method for early warning and identifying the enterprise external migration is characterized by comprising the following steps:
(1) Acquiring qualified enterprise and business information data, operator data of the enterprise and enterprise supporting policy related index data of the location of the enterprise according to specified requirements by using a feature collector, matching, splicing, cleaning and standardizing to obtain an enterprise feature data set, and matching the existing ex-situ enterprise example with the enterprise feature data set;
(2) Selecting important feature items from all feature items of the enterprise feature data set by using a feature selector to generate an important feature item subset;
(3) Fusing the important feature item subsets by using a feature fusion device to obtain a fusion feature data set;
(4) Randomly dividing the fusion characteristic data set into an enterprise characteristic data set for training and an enterprise characteristic data set for testing;
(5) Inputting the training enterprise characteristic data set into an algorithm strategy module to obtain an early warning identification model, inputting the testing enterprise characteristic data set into the early warning identification model to obtain an external transition probability set of each testing enterprise, and calculating an evaluation value of the obtained early warning identification model;
(6) And selecting the early warning identification model with good evaluation value obtained by training, and inputting the enterprise characteristic data needing early warning identification into the selected early warning identification model to obtain an output value, namely the enterprise migratory probability.
2. The method for early warning and identifying the enterprise outside migration according to claim 1, wherein the step (1) comprises the following steps:
(1.1) starting a characteristic collector, calling a multi-source data interface, and acquiring enterprise codes, enterprise names, unified social credit codes, enterprise types, latest industry types of enterprises, registered capital, real payment capital, operation duration, operation range, social security number, external investment times, tax payment A-level and other dimensional index data of national enterprises from enterprise business information data;
(1.2) acquiring dimension index data such as certificate numbers, fixed telephone installation numbers, broadband installation numbers, last-month fixed telephone installation numbers, last-month broadband installation numbers, new fixed telephone installation numbers, new broadband installation numbers, new fixed telephone dismantling numbers, broadband dismantling numbers, fixed telephone moving numbers, broadband moving numbers and the like of enterprise storage clients from operator data;
(1.3) acquiring index data related to enterprise support policies of the locations of the enterprises from a policy data source, matching the enterprise portrait data of the enterprises and the enterprise portrait data of the operators according to social unified credit codes, and matching the index data related to the enterprise support policies of the locations of the enterprises according to the locations of the enterprises to obtain a spliced original feature data set;
(1.4) cleaning and standardizing the spliced original characteristic data set to obtain an enterprise characteristic data set;
and (1.5) matching the existing ex-transit enterprise examples with the enterprise characteristic data set, wherein the successfully matched enterprises are marked as positive examples, and the unsuccessfully matched enterprises are marked as negative examples.
3. The method for early warning and identifying the enterprise outside migration according to claim 2, wherein the step (1) further comprises the following steps:
(1.6) carrying out time processing based on manual experience, data binning processing, combined classification processing and Feature Tools module calling based on automatic integration on the original Feature data set by using a Feature generator;
and (1.7) carrying out One-HotEncoding processing or LDA processing or cosine similarity processing or neural network processing on the original feature data set by using a feature extractor.
4. The method for early warning identification of enterprise relocation according to claim 1, wherein the step (2) comprises the following steps:
(2.1) selecting an optimal feature item from all feature items of the enterprise feature data set as a selected set;
(2.2) selecting an optimal feature item from the feature items in the enterprise feature data set except the selected set, adding the feature item into the selected set, and calculating the gain value of the selected set at the moment;
(2.3) evaluating the gain value of the selected set, and if the gain value of the selected set is not the maximum, repeating the step (2.2);
and (2.4) combining the data corresponding to each feature item in the enterprise feature data set and the selected set to generate an important feature item subset.
5. The method for early warning and identifying the enterprise outside migration according to claim 4, wherein in the step (2.2), the gain value is calculated according to the formula:
Figure FDA0003834206530000031
wherein Gain (A) is the Gain value of the selected set, ent (D) is the information entropy of the enterprise characteristic data set, D is the enterprise characteristic data set, V is the subset number of the enterprise characteristic data set, and D v A vth subset of the enterprise trait dataset;
the formula of the information entropy Ent (D) is:
Figure FDA0003834206530000032
wherein p is k The sample number is the ratio of the kth sample in the enterprise characteristic data set, and n is the number of subsets in the enterprise characteristic data set.
6. The method for early warning and identifying the enterprise outside the enterprise according to claim 1, wherein in the step (5), the calculation formula of the evaluation value of the early warning and identifying model is:
Figure FDA0003834206530000033
wherein, F1 is an evaluation value, precision is a proportion of the number of true positive samples in the samples predicted to be positive by the early warning identification model, and the Recall value is a proportion of the number of true positive samples in the actual samples.
7. The method for early warning and identifying the enterprise migration according to claim 1, wherein in the step (6), the output value ranges from 0 to 1, when the output value is less than 0.25, the enterprise migration risk is low risk, when the output value is greater than or equal to 0.25 and less than 0.5, the enterprise migration risk is medium risk, when the output value is greater than or equal to 0.5 and less than 0.75, the enterprise migration risk is medium risk, and when the output value is greater than or equal to 0.75 and less than 1, the enterprise migration risk is very high risk.
8. The system for early warning identification of enterprise relocation as claimed in claim 1, comprising:
a characteristic collector: the system is used for acquiring, cleaning and standardizing enterprise industry and commerce information data needing early warning identification, enterprise portrait data of operators and enterprise support policy related index data of locations of enterprises;
a feature selector: the characteristic collector is used for collecting the characteristics output by the characteristic collector;
a feature fusion device: the device is used for fusing the characteristics output by the characteristic collector and the characteristics output by the characteristic selector;
an algorithm strategy module: the early warning recognition system is used for training a plurality of weak learners and fusing and constructing an early warning recognition model according to output results of the weak learners;
the output end of the characteristic collector is connected with the input end of the characteristic selector, the output end of the characteristic selector is connected with the input end of the characteristic fusion device, and the output end of the characteristic fusion device is connected with the input end of the algorithm strategy module.
9. The system for early warning identification of enterprise relocation as claimed in claim 8, further comprising:
a feature generator: the device is used for creating new data characteristics by the data output by the characteristic collector in a model label prediction mode;
a feature extractor: the text data in the data output by the feature generator is used for feature extraction;
the output end of the feature collector is connected with the input end of the feature generator and the input end of the feature extractor respectively, and the output end of the feature generator and the output end of the feature extractor are both connected with the input end of the feature selector.
10. An apparatus for early warning of enterprise migration according to claims 8-9, comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor implements the steps of the method according to any of claims 1-7 when executing the computer program.
CN202211083680.9A 2022-09-06 2022-09-06 Early warning and identification method, system and device for enterprise ex-situ migration Pending CN115358481A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211083680.9A CN115358481A (en) 2022-09-06 2022-09-06 Early warning and identification method, system and device for enterprise ex-situ migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211083680.9A CN115358481A (en) 2022-09-06 2022-09-06 Early warning and identification method, system and device for enterprise ex-situ migration

Publications (1)

Publication Number Publication Date
CN115358481A true CN115358481A (en) 2022-11-18

Family

ID=84007281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211083680.9A Pending CN115358481A (en) 2022-09-06 2022-09-06 Early warning and identification method, system and device for enterprise ex-situ migration

Country Status (1)

Country Link
CN (1) CN115358481A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115660796A (en) * 2022-12-09 2023-01-31 北京中科闻歌科技股份有限公司 Tax fund management method, device, equipment and storage medium for migration risk enterprise
CN116739395A (en) * 2023-08-15 2023-09-12 浙江同信企业征信服务有限公司 Enterprise outward migration prediction method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115660796A (en) * 2022-12-09 2023-01-31 北京中科闻歌科技股份有限公司 Tax fund management method, device, equipment and storage medium for migration risk enterprise
CN116739395A (en) * 2023-08-15 2023-09-12 浙江同信企业征信服务有限公司 Enterprise outward migration prediction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN115358481A (en) Early warning and identification method, system and device for enterprise ex-situ migration
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
Park et al. Explainability of machine learning models for bankruptcy prediction
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN115547466B (en) Medical institution registration and review system and method based on big data
CN107463935A (en) Application class methods and applications sorter
Yarragunta et al. Prediction of air pollutants using supervised machine learning
CN116522912B (en) Training method, device, medium and equipment for package design language model
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN111143624B (en) Land approval surveying and mapping data-oriented adaptive calculation rule base matching method and system
CN112506930B (en) Data insight system based on machine learning technology
Hennebold et al. Cooperation of human and active learning based AI for fast and precise complaint management
Ha et al. Examine the effectiveness of patent embedding-based company comparison method
Zou et al. An improved model for spam user identification
CN117668205B (en) Smart logistics customer service processing method, system, equipment and storage medium
CN117112791B (en) Unknown log classification decision system, method and device and readable storage medium
CN112418599B (en) Enterprise growth path planning method and system based on index set
Bharath et al. An Innovative Software Bug Prediction System using Random Forest Algorithm for Enhanced Accuracy in Comparison with Logistic Regression Algorithm
CN117764724A (en) Intelligent credit rating report construction method and system
Judijanto et al. Trends and Evolution of Data-Driven Financial Management: A Bibliometric Analysis of Scientific Publications and Their Influence on Financial Decision Making
Hu Software Engineering Classification Model and Algorithm Based on Big Data Technology
CN117251605A (en) Multi-source data query method and system based on deep learning
Tang et al. Research and Practice of Telecommunication User Rating Method Based on Machine Learning.
Rohaan Prioritizing requests for quotation on sales potential
CN117592943A (en) Science and technology service data collaboration system based on Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination