CN115271442A

CN115271442A - Modeling method and system for evaluating enterprise growth based on natural language

Info

Publication number: CN115271442A
Application number: CN202210896607.7A
Authority: CN
Inventors: 杨献祥; 聂志华; 程光剑; 徐杰
Original assignee: Jiangxi Intelligent Industry Technology Innovation Research Institute
Current assignee: Jiangxi Intelligent Industry Technology Innovation Research Institute
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-11-01

Abstract

The invention discloses a modeling method and a system for evaluating enterprise growth based on natural language. The method comprises the following steps: constructing an enterprise index system, wherein the index system comprises a structured index system and index data extracted from text contents through natural language processing technology; analyzing the index data through technologies such as 5sigma and binning, determining abnormal values in the index data, and perfecting the abnormal values; carrying out weight calculation on the index data after normalization processing; screening the indexes according to the calculated weight result and the correlation degree between the indexes; carrying out weight calculation on the screened indexes again, and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights; and normalizing the value of the weighted sum, wherein the normalized result is the growth score of the enterprise. The beneficial effect of this application is: the method solves the problems of poor data quality, insufficient data dimensionality and low adaptability of the method for evaluating the growth of small and medium-sized enterprises in the current stage.

Description

Modeling method and system for evaluating enterprise growth based on natural language

Technical Field

The invention relates to the technical field of data processing, in particular to a modeling method and a modeling system for evaluating enterprise growth based on natural language.

Background

The growth of enterprises is an important standard for estimating the project potential of investors, and the high growth performance attracts the eyes of a large number of investors. Under the large-market environment, a plurality of innovative, entrepreneurship and growth-type enterprises are newly increased every year, and the financial market provides financing platforms and channels for the enterprises. The growth of the companies has important research and reference significance for planning development, investment decision of investors and normative operation of financial markets.

At present, a method adopted by enterprise growth assessment, such as patent CN 113450009A, is mainly to establish an enterprise growth assessment system, establish average values of enterprise growth scores of different industries and different scales, and analyze relevant dimensional characteristics; performing financial valuation analysis on the enterprise according to financial data of the enterprise to obtain an expected valuation of the enterprise; and summarizing the enterprise growth evaluation and financial analysis results in a preset format to generate an enterprise growth evaluation report.

However, the method relies on the comprehensive calculation of financial indexes and carries out verification according to hypothesis testing of probability theory, but in practical situations, most enterprise financial data are difficult to obtain, especially for medium and small enterprises, due to the fact that the enterprise financial data are coarse and the dimensionality of structured data is small, the quality of the data is poor, the dimensionality of the data is insufficient, and the applicability of the finally obtained enterprise growth assessment result is not high.

Disclosure of Invention

Based on this, an object of the present invention is to provide a modeling method and system for evaluating enterprise growth based on natural language, which solve the problems of poor quality, insufficient data dimensions, and low adaptability of the method for evaluating the growth of small and medium-sized enterprises in the present stage.

In a first aspect, the present application provides a modeling method for evaluating enterprise growth based on natural language, the method including the steps of:

s11, constructing an enterprise index system, wherein the index system comprises a structured index system and index data extracted from text contents through natural language processing technology extraction;

s12, analyzing the index data through technologies such as 5sigma and binning, determining abnormal values in the index data, and perfecting the abnormal values;

s13, carrying out weight calculation on the index data after normalization processing;

s14, screening the indexes according to the calculated weight result and the correlation degree among the indexes;

s15, carrying out weight calculation on the screened indexes again, and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights;

s16, normalizing the value of the weighted sum, wherein the normalized result is the growth score of the enterprise;

wherein, in step S12, the step of refining the abnormal value includes:

s121, judging whether the abnormal value has corresponding index data within 3 years;

s122, if corresponding index data exist, filling the abnormal value by the average value in the last 3 years;

s123, if the corresponding index data does not exist, deleting the abnormal value;

and S124, normalizing the index data through MAX-MIN.

The modeling method for evaluating enterprise growth based on natural language has the advantages that: index data extracted from text contents (enterprise news, operating range and registered fund information) is extracted through a natural language processing technology, financial and non-financial indexes are combined, parameter indexes of various growth properties of the enterprises are quantified through operations such as weight calculation, screening and normalization processing, growth property scores of the enterprises are further evaluated to obtain growth property scores of the enterprises, the growth property scores of the small and medium-sized enterprises are calculated through the method, the dependency on financial data is lower, the data quality requirement is lower, and the reliability of the growth property scores obtained through calculation is higher.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, the step of extracting the index data extracted from the text content by using the natural language processing technology includes:

processing the enterprise news information containing the tags through a natural language processing technology to generate word vectors; wherein the label comprises a positive, a negative, and a neutral;

performing tendency classification prediction on news information by using a BERT model and combining an LSTM algorithm;

and integrating the predicted result and the training data, and counting the positive news percentage of each enterprise in a preset time period to serve as a first modeling index.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, the step of extracting the index data extracted from the text content by using the natural language processing technology further includes:

dividing words of the enterprise operation range through a natural language processing technology and converting the words into word vectors;

and counting the number of the business operation ranges and the number of the business operation range changes of the enterprise, and taking the number of the business operation ranges and the number of the business operation range changes as a second modeling index.

the rate of increase or decrease after the capital alteration is registered with the enterprise is used as a third modeling indicator.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, the step of screening the indicators according to the calculated weight result and the correlation between the indicators includes:

eliminating indexes with weight values smaller than a preset value according to the calculated weight result to obtain primary screening indexes;

calculating the correlation degree between every two primary screening indexes through a Pearson correlation coefficient;

and if the correlation degree is greater than the preset value, eliminating the index with smaller weight value to obtain a secondary screening index.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, after the step of calculating a weighted sum of the numerical values of the selected indexes and the corresponding weights, the method further includes:

verifying the growth score of the enterprise according to the calculation result of the regression algorithm;

if the enterprise growth score is within a preset range, executing step S16;

and if the growth score of the enterprise is out of the preset range, repeatedly screening indexes by a recursive feature elimination method until the growth score of the enterprise is in the preset range.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, after the step of normalizing the weighted sum value, the method further includes:

dividing enterprises into growth stages, namely a growth stage, a maturity stage and a decline stage according to different grading combination intervals;

training and evaluating the growing scores of enterprises through a random forest algorithm, an Xgboost algorithm and an SVM algorithm, and optimally determining score threshold values divided into growing stages according to precision ratio, recall ratio, F1 value and AUC value results;

determining the business growth stage according to the growth score.

Preferably, in the modeling method for evaluating enterprise growth based on natural language according to the present application, the step of training and evaluating the growth scores of the enterprises through the random forest algorithm, the Xgboost algorithm and the SVM algorithm to optimally determine the score threshold divided into the growth stages according to the results of the precision ratio, the recall ratio, the F1 value and the AUC value includes:

screening out a scoring combination which meets preset requirements according to the precision ratio, the recall ratio, the F1 value and the AUC value of the random forest algorithm and the Xgboost algorithm; the preset requirements are as follows: the precision ratio is not less than 0.85, the recall ratio is not less than 0.85, the F1 value is not less than 0.85, and the AUC value is not less than 0.9;

the grading combinations meeting the preset requirements are arranged in a descending order according to the size of the F1 value, and the grading combinations meeting the preset requirements are verified in sequence through an SVM algorithm;

and when any scoring combination meets the requirement of the verification, taking the current scoring combination as a scoring threshold value divided into the growth stages.

In a second aspect, the present application provides a modeling system for evaluating enterprise growth, the system comprising:

a system construction module: the system is used for constructing an enterprise index system, wherein the index system comprises a structured index system and index data extracted from text contents through natural language processing technology extraction;

a data sorting module: the system is used for analyzing the index data through technologies such as 5sigma and binning, determining abnormal values in the index data and perfecting the abnormal values;

wherein, the data arrangement module includes:

an abnormal value query unit: judging whether the abnormal value has corresponding index data within 3 years;

a filling unit: for populating the outliers with a near 3 year mean value when corresponding index data exists;

a deletion unit: deleting the abnormal value when corresponding index data exists;

a normalization unit: the index data is subjected to normalization processing through MAX-MIN;

a first calculation module: the weight calculation module is used for carrying out weight calculation on the index data after the normalization processing;

an index screening module: the index screening module is used for screening the indexes according to the calculated weight result and the correlation degree between the indexes;

a second calculation module: the weight calculation module is used for carrying out weight calculation on the screened indexes again and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights;

a scoring module: and normalizing the value of the weighted sum, wherein the normalized result is the growth score of the enterprise.

Preferably, the system building module specifically includes:

an information processing unit: the system is used for processing the enterprise news information containing the tags through a natural language processing technology to generate word vectors; wherein the label comprises a positive, a negative, and a neutral;

a classification prediction unit: the method is used for carrying out tendency classification prediction on news information through a BERT model in combination with an LSTM algorithm;

a first index modeling unit: and integrating the predicted result and the training data, and counting the positive news proportion of each enterprise in a preset time period to serve as a first modeling index.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a modeling method for assessing enterprise growth based on natural language according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for extracting index data extracted from text contents by using a natural language processing technique in a modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for screening indicators according to a calculated weight result and a correlation between the indicators in the modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for verifying an enterprise growth rating score in a modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention;

FIG. 5 is a flowchart of a recursive feature elimination method in the modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention;

FIG. 6 is a modeling method for evaluating enterprise growth based on natural language according to a second embodiment of the present invention;

fig. 7 is an ROC graph of a multi-classification model in the enterprise growth assessment modeling method according to the second embodiment of the present invention;

fig. 8 is a schematic structural diagram of a modeling system for evaluating enterprise growth according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by one of ordinary skill in the art that the embodiments described herein may be combined with other embodiments without conflict.

Unless otherwise defined, technical or scientific terms referred to herein should have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a single reference) are to be construed in a non-limiting sense as indicating either the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method adopted by enterprise growth evaluation at present mainly comprises the steps of establishing an enterprise growth evaluation system, establishing enterprise growth scoring mean values of different industries and different scales, and analyzing related dimensional characteristics; then, performing financial valuation analysis on the enterprise according to financial data of the enterprise to obtain an expected valuation of the enterprise; and summarizing the enterprise growth evaluation and financial analysis results in a preset format to generate an enterprise growth evaluation report.

Therefore, the invention provides a modeling method and a system for evaluating the growth of enterprises based on natural language, so as to improve the reliability of the growth scoring for small and medium-sized enterprises.

Referring to fig. 1, a modeling method for evaluating enterprise growth performance based on natural language according to a first embodiment of the present invention includes the following steps:

s11, constructing an enterprise index system;

the index system comprises a structured index system and index data extracted from text contents through natural language processing technology extraction. The text content is specifically the text data such as enterprise news, operating range and registered fund information of medium and small enterprises, and the acquisition path of the text content can be extracted from the world wide web through a web crawler technology.

And S12, analyzing the index data through technologies such as 5sigma and binning, determining abnormal values in the index data, and perfecting the abnormal values.

Specifically, the step of refining the abnormal value includes:

and step S121, judging whether the abnormal value has corresponding index data within 3 years.

And step S122, if the corresponding index data exists, filling the abnormal value by the average value of nearly 3 years.

And step S123, deleting the abnormal value if the corresponding index data does not exist.

It can be understood that, for the null value part, there may be a case where the obtained data is incomplete, and the influence of the null value is minimized by adopting a means of querying that the average value of the index of the enterprise is correspondingly filled in the null value part in the last three years.

And S124, normalizing the index data through MAX-MIN.

In the embodiment of the invention, the purpose of perfecting the index data is to process abnormal data such as null values, abnormal values and the like in the acquired index data. In addition, because different index data have very large differences, such as growth rates, the data are generally between 0 and 1, and the absolute values of some data may be very large, which may cause the influence of the large values of some index data to be very large, and unfair to other indexes. Index data are normalized through preprocessing, and influence of the index data is effectively balanced.

And S13, performing weight calculation on the index data after the normalization processing.

In the embodiment of the present invention, the way of performing weight calculation on the normalized index data is specifically to perform weight calculation by using a combined objective weighting method (critic-entropy weight method). The calculation method determines the weight through a certain mathematical method according to the relation between the preprocessed index data, the judgment result does not depend on subjective judgment of people, and the calculation method has a strong mathematical theoretical basis. By way of example and not limitation, in the embodiment of the present invention, calculating the preprocessed index data by using a combined objective weighting method is only a preferred calculation method, and the application does not specifically limit the weight calculation method of the preprocessed index data.

And S14, screening the indexes according to the calculated weight result and the correlation degree between the indexes.

By analyzing the result of the index weight, the indexes with low weight value and high similarity are screened and removed, so that the screened indexes have better representativeness.

And S15, carrying out weight calculation on the screened indexes again, and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights.

And S16, carrying out normalization processing on the weighted sum value, wherein the normalized result is the growth score of the enterprise.

In summary, the modeling method for evaluating enterprise growth based on natural language provided by the invention extracts index data extracted from text contents (enterprise news, business scope and registered fund information) by natural language processing technology, completes data, eliminates abnormal data and supplements missing data, performs weight calculation on the index data, screens and eliminates part of index data with high similarity and low weight, and ensures that the retained index data has high reliability support for enterprise growth evaluation; and the growth evaluation of the enterprise can be quantified by calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights and carrying out normalization processing. By the method, the influence of abnormal index data on the final calculation result is eliminated, the final calculated enterprise growth score has higher accuracy by calculating the weighted sum of the numerical values of all the screened indexes and the corresponding weights and the normalization calculation score mode, and the mode of extracting the index data by the natural language processing technology is favorable for evaluating the growth of small and medium enterprises. And further, the growth evaluation result of the enterprise is more reliable.

Preferably, please refer to fig. 2, which is a flowchart illustrating a method for extracting index data extracted from text content by using a natural language processing technique in a modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention, the method specifically includes the following steps:

and S21, processing the enterprise news information containing the labels through a natural language processing technology to generate word vectors.

Wherein the label includes positive, negative, and neutral.

And S22, performing tendency classification prediction on the news information by using a BERT model and combining an LSTM algorithm.

The BERT model is an unsupervised NLP pre-training model and structurally is an encoding part of a Transformer, and each block mainly comprises a multi-head self-extension, standardization (Norm), residual connection and Feed Fordawrd. In a specific task, the method mainly comprises two stages of model pre-training and model fine-tuning. In the model pre-training stage, because the model parameters are huge, usually in the order of tens of millions or hundreds of millions, a large amount of data training is needed, and the time consumption is long, the model is developed by google and is trained, and only the source data set needs to be crawled or used; in the model fine-tuning phase, the model needs to be fine-tuned for a specific task.

The LSTM (Long Short-Term Memory) algorithm is one of deep learning methods, and is a specific form of RNN (Recurrent neural network). The LSTM is characterized in that the weight of self-circulation is changed by increasing an input threshold, a forgetting threshold and an output threshold, so that the weight of index data at different moments can be dynamically changed under the condition that model parameters are fixed, and the problem of gradient disappearance or gradient expansion is solved.

And S23, integrating the predicted result and the training data, and counting the positive news proportion of each enterprise in a preset time period to serve as a first modeling index.

And S24, segmenting words in the enterprise operation range through a natural language processing technology and converting the words into word vectors.

And S25, counting the number of the business operation ranges of the enterprise and the number of the business range changes, and taking the number of the business ranges and the number of the business range changes as a second modeling index.

Step S26, the increased rate or the decreased rate after the capital change is registered to the enterprise is used as a third modeling index.

By way of example and not limitation, in the embodiment of the present invention, besides the first modeling index, the second modeling index, and the third modeling index, financial and non-financial indexes of the enterprise are used as modeling indexes, and the financial and non-financial indexes can be obtained from data such as a balance sheet, a profit sheet, a cash flow sheet, and a yearbook file disclosed by the enterprise.

In conclusion, the modeling indexes are obtained through the method, and the structural indexes and the modeling indexes extracted by the text information are combined into an index system. The problems of insufficient data dimensionality and low data reliability of medium and small enterprises are effectively solved, and the data source reliability of growth evaluation of the medium and small enterprises is remarkably improved.

Preferably, please refer to fig. 3, which is a flowchart illustrating a method for screening indexes according to a calculated weight result and a correlation between the indexes in a modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention, the method specifically includes the following steps:

and S31, eliminating indexes with weight values smaller than a preset value according to the calculated weight results to obtain primary screening indexes.

In the embodiment of the invention, the indexes with extremely small weight values are eliminated by judging whether the weight values calculated by the indexes are smaller than the preset value, so that the primary screening indexes are obtained.

And S32, calculating the correlation degree between every two primary screening indexes through the Pearson correlation coefficient.

And S33, if the correlation is larger than a preset value, eliminating the indexes with smaller weight values to obtain secondary screening indexes.

It can be understood that, in order to avoid that the index with higher similarity affects the finally calculated growth score, in the embodiment of the present invention, the correlation between two indexes is calculated by using pearson correlation coefficient, for example: the relevance of the index 1 (the weight value is 0.056) and the index 2 (the weight value is 0.034) is 0.68, the preset relevance preset value is 0.6, the relevance of the index 1 and the index 2 is judged to be strong, further, the weight values of the index 1 and the index 2 are compared, the index 1 with the larger weight value is reserved, the index 2 with the smaller weight value is eliminated, and the screening of the indexes is completed. For another example: the degree of correlation between index 1 (weight value of 0.056) and index 3 (weight value of 0.052) is 0.12, and both indexes are left without being processed.

Preferably, please refer to fig. 4, which is a flowchart illustrating a method for verifying an enterprise growth rating score in a modeling method for evaluating enterprise growth based on natural language according to an embodiment of the present invention, the method specifically includes the following steps:

and S41, verifying the growth rating of the enterprise according to the calculation result of the regression algorithm.

In the embodiment of the present invention, the regression algorithm may use any one of a random forest regression algorithm and a ridge regression algorithm to perform result verification, which is not specifically limited in the present application.

And S42, if the growth score of the enterprise is within the preset range, executing the step S16.

And S43, if the growth score of the enterprise is out of the preset range, repeatedly screening the indexes by a recursive feature elimination method until the growth score of the enterprise is in the preset range.

Specific examples thereof include: analyzing and screening indexes through actual services, for example, most of companies in the automobile and automobile distribution industry have outlets, but enterprises in the food and beverage industry do not have outlets, so that the automobile and automobile distribution enterprise has the outlet ratio index during modeling, and the food and beverage enterprise does not have the outlet ratio index; training and verifying by using a regression algorithm, and if the verified index result does not meet the following requirements:

MAE≤0.2，MSE≤0.15,MAE≤0.15,R²>0.9

MAE (squared absolute error):

wherein, y_iIndicating true value, f (x)_iIndicating the predicted value. The squared absolute error describes the average of the sum of the absolute values of the real values minus the predicted values.

MSE (mean square error):

wherein, y_iIndicating true value, f (x)_iThe predicted value is represented. Mean square error describes the real value minus the predicted value, squared and summed for averaging.

RMSE (root mean square error):

the root mean square error is a root number based on the mean square error, and the root mean square error has the main function of enabling the error and a sample target variable to be in the same order of magnitude and is used for better describing data.

R²(coefficient of determination);

wherein, y_iIndicating true value, f (x)_iIndicating the predicted value.

The decision coefficient is a standardized evaluation method, and is generally a value between 0 and 1, and the closer the value is to 1, the better the model effect is; conversely, a value closer to 0 indicates a lower effect. Equal to 0, the model can be considered to select the average value of the true values as the predicted value, and negative values also exist, so that the occurrence of negative values indicates that the model has very poor effect.

It should be noted that: the recursive feature elimination method is a wrapping type feature selection method, and the principle of the method is that after a base model is trained on a data set, features with the minimum weight are removed from a feature set according to the obtained feature weight, the base model (random forest regression and lasso regression) is retrained again by using the reserved feature set, the above processes are repeated until no features can be removed, and finally a feature ranking list is obtained. Fig. 5 is a flow chart of the recursive feature elimination method. As features are removed, in-process feature sets are recorded, and a set of nested feature subsets can be obtained

N represents the total number of the features, and then the nested feature subsets are evaluated by taking the precision of the base model as an evaluation standard, so that the optimal feature subset is obtained.

By the method for verifying the enterprise growth scores, unreasonable conditions of the enterprise growth scores obtained by index calculation and evaluation after screening are effectively avoided. And when the result of the regression calculation is out of the preset range, further screening the indexes by a recursive characteristic elimination method until the screened indexes meet the verification requirement according to the result of the regression calculation.

Preferably, referring to fig. 6, a modeling method for evaluating enterprise growth based on natural language provided by the second embodiment of the present invention specifically includes the following steps:

and S61, constructing an enterprise index system.

And S62, analyzing the index data through technologies such as 5sigma and binning, determining abnormal values in the index data, and perfecting the abnormal values.

And S63, performing weight calculation on the index data after the normalization processing.

And S64, screening the indexes according to the calculated weight result and the correlation degree between the indexes.

And step S65, carrying out weight calculation on the screened indexes again, and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights.

And S66, carrying out normalization processing on the weighted sum value, wherein the normalized result is the growth score of the enterprise.

And S67, dividing the enterprise into growth stages, namely a growth stage, a maturity stage and a decline stage according to different grading combination intervals.

And S68, training and evaluating the growth score of the enterprise through a random forest algorithm, an Xgboost algorithm and an SVM algorithm, and optimally determining a score threshold value divided into growth stages according to the results of precision ratio, recall ratio, F1 value and AUC value.

The F1 value is a special condition of the F-Measure, is a harmonic mean of the precision ratio and the recall ratio, and can comprehensively reflect the accuracy of the classification result, and the closer the result is to 1, the higher the accuracy is. AUC (Area Under rock Curve) values are a measure of the quality of the classification model. The AUC value is the area below the ROC curve, and is a comprehensive index for evaluating the accuracy of the classification model.

And step S69, determining the growth stage of the enterprise according to the growth score.

In the embodiment of the present invention, the verification result needs to satisfy:

the precision ratio is more than or equal to 0.85, the recall ratio is more than or equal to 0.85₁≥0.85,AUC>0.9

For convenience of describing precision and recall, examples are: in information retrieval, it is often concerned about "how much proportion of retrieved information is of interest to the user" how much information of interest to the user is retrieved ". Precision and recall are performance metrics suitable for such demands. For the binary problem, a "confusion matrix" (fusion matrix) of the classification results can be listed as follows:

	predicted results	Predicted results
			Real situation	Correction example	Counter example
Correction example	TP	FN
			Counter example	FP	TN

Generally, when the precision ratio is high, the recall ratio is often low; when the recall ratio is high, the precision ratio is often low.

AUC values (Area Under the dark) are a measure of how well the classification model is. To better explain this concept, another concept is introduced, ROC, named Receiver Operating characteristics, whose main analysis tool is a curve, ROC curve, drawn on a two-dimensional plane. The abscissa of the plane is False Positive Rate (FPR) and the ordinate is True Positive Rate (TPR). For a classifier, a TPR and FPR point pair may be derived from its performance on the test sample. Thus, the classifier can be mapped to a point on the ROC plane. By adjusting the threshold used in the classification of the classifier, a curve passing through (0, 0) and (1, 1) can be obtained, which is the ROC curve of the classifier. In general, this curve should be above the (0, 0) and (1, 1) links. Because the ROC curve formed by the (0, 0) and (1, 1) links actually represents a random classifier. If unfortunate, a classifier is obtained that lies below this line, an intuitive remedy is to reverse all the predictions, namely: and if the output result of the classifier is a positive class, the final classification result is a negative class, otherwise, the final classification result is a positive class. Although expressing the performance of a classifier with ROC curve is intuitive. Thus, the Area Under rock dark (AUC) appears. As the name implies, the value of AUC is the fraction of the area under ROC curve. Typically, the AUC has a value between 0.5 and 1.0, with a larger AUC representing better performance. As shown in fig. 7: the invention provides an ROC curve diagram of a multi-classification model in the enterprise growth assessment modeling method provided by the embodiment II of the invention.

The F1 value is the special case of F-Measure or F-Score, and is commonly used for evaluating the classification results of various problems in the relevant field of machine learning, the F-Measure is obtained through the accuracy and recall ratio of the positive class, and the formula is as follows:

from the above equation, it can be seen that his effect is mainly to set an amplification ratio, and has no influence on the relative result, and when β takes 1, it is the most common F1 value, and the equation is as follows:

the F1 value is a harmonic mean of the precision ratio and the recall ratio, the accuracy of the classification result can be comprehensively reflected, and the result is close to 1, which indicates that the accuracy is higher.

It should be noted that, in the embodiment of the present invention, step S68 specifically includes:

screening out a scoring combination which meets preset requirements according to precision ratio, recall ratio, F1 value and AUC value of a random forest algorithm and an Xgboost algorithm;

the preset requirements are as follows: the precision ratio is not less than 0.85, the recall ratio is not less than 0.85, the F1 value is not less than 0.85, and the AUC value is not less than 0.9;

According to the embodiment of the invention, after the score combinations meeting the conditions that the classification results of the random forest algorithm and the Xgboost algorithm are met are selected, the selected score combinations are sequentially ranked from high to low according to the F1 value, the combinations with large F1 values are selected, classification and verification are carried out through the SVM algorithm, if the verification result also meets the classification standard, the classification is finished, and if not, one combination is replaced until the result is met.

It is understood that steps S61 to S66 are identical to those of the first embodiment of the present invention. The difference is that the second embodiment of the present invention further includes a process step of rating the enterprise, specifically, in steps S67-S69, the enterprise is divided into growth stages according to the growth period, the maturity period and the decline period, and the growth scores of the enterprise are trained and evaluated, so as to define the growth stages of the enterprise. The purpose of rating the enterprise is accomplished.

Referring to fig. 8, a modeling system for evaluating enterprise growth according to a third embodiment of the present invention includes.

The system construction module 81: the method is used for constructing an enterprise index system, and the index system comprises a structured index system and index data extracted from text contents through natural language processing technology.

The data sorting module 82: the method is used for analyzing the index data through 5sigma, binning and other technologies, determining abnormal values in the index data and perfecting the abnormal values.

The first calculation module 83: and the weight calculation module is used for carrying out weight calculation on the index data after the normalization processing.

The index screening module 84: and the index screening module is used for screening the indexes according to the calculated weight result and the correlation degree between the indexes.

The second calculation module 85: and the weighting calculation module is used for carrying out weight calculation on the screened indexes again and calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights.

The scoring module 86: and normalizing the value of the weighted sum, wherein the normalized result is the growth score of the enterprise.

Wherein the data sorting module 82 comprises:

abnormal value query unit: and judging whether the abnormal value has corresponding index data within 3 years.

A filling unit: for populating the outliers with a near 3 year average when corresponding metric data exists.

A deletion unit: and deleting the abnormal value when corresponding index data exists.

A normalization unit: the index data is normalized through MAX-MIN.

Further, the architecture module 81 specifically includes:

an information processing unit: the system is used for processing the enterprise news information containing the tags through a natural language processing technology to generate word vectors; wherein the label comprises a positive, a negative, and a neutral.

A classification prediction unit: the method is used for performing tendency classification prediction on news information through a BERT model in combination with an LSTM algorithm.

A first index modeling unit: and the method is used for integrating the predicted result and the training data and counting the positive news percentage of each enterprise in a preset time period as a first modeling index.

A second information processing unit: the method is used for segmenting words of the enterprise operation range through a natural language processing technology and converting the words into word vectors.

A second index modeling unit: and counting the number of the business ranges of the enterprise and the number of the business range changes, and taking the number of the business ranges and the number of the business range changes as a second modeling index.

A third modeling index unit: for increasing or decreasing rates after a business registered capital change as a third modeling indicator.

Further, the index screening module 84 includes:

primary screening unit: and the primary screening indexes are obtained by eliminating the indexes with the weight values smaller than the preset value according to the calculated weight results.

A correlation calculation unit: and calculating the correlation degree between every two primary screening indexes through the Pearson correlation coefficient.

A secondary screening unit: and when the correlation degree between every two indexes is greater than a preset value, eliminating the indexes with smaller weight values to obtain a secondary screening index.

Further, the system further comprises:

a rating establishing module: the system is used for dividing enterprises into growth stages, namely a growth stage, a maturity stage and a decline stage according to different grading combination intervals;

a training evaluation module: the system is used for training and evaluating the growing scores of enterprises through a random forest algorithm, an Xgboost algorithm and an SVM algorithm, and optimally determining the scoring threshold value divided into the growing stages according to the results of precision ratio, recall ratio, F1 value and AUC value;

a rating determination module: for determining the business growth stage based on the growth score.

Further, the training evaluation module specifically includes:

a condition screening unit: and the method is used for screening out a scoring combination which meets the preset requirements according to the precision ratio, the recall ratio, the F1 value and the AUC value of the random forest algorithm and the Xgboost algorithm.

Specifically, the preset requirements are as follows: the precision ratio is not less than 0.85, the recall ratio is not less than 0.85, the F1 value is not less than 0.85, and the AUC value is not less than 0.9.

A sorting verification unit: the system is used for sorting the scoring combinations meeting the preset requirements in a descending order according to the F1 value, and verifying the scoring combinations meeting the preset requirements in sequence through an SVM algorithm;

an end unit: and when any scoring combination meets the requirement of the verification, taking the current scoring combination as a scoring threshold value divided into the growth stages.

By combining the modeling method for evaluating the enterprise growth with the natural language, the modeling system for evaluating the enterprise growth provided by the invention extracts the index data extracted from text contents (enterprise news, business scope and registered capital information) by a natural language processing technology, completes the data, eliminates abnormal data and supplements missing data, performs weight calculation on the index data, screens and eliminates partial index data with high similarity and low weight, and ensures that the retained index data has high reliability support on the enterprise growth evaluation; and the growth evaluation of the enterprise can be quantified by calculating the weighted sum of the numerical values of the screened indexes and the corresponding weights and carrying out normalization processing. By the method, the influence of abnormal index data on the final calculation result is eliminated, the final calculated enterprise growth score has higher accuracy by calculating the weighted sum of the numerical values of all the screened indexes and the corresponding weights and the normalization calculation score mode, and the mode of extracting the index data by the natural language processing technology is favorable for evaluating the growth of small and medium enterprises. And further, the growth evaluation result of the enterprise is more reliable.

It should be noted that the above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A modeling method for assessing enterprise growth based on natural language, the method comprising:

wherein, in step S12, the step of refining the abnormal value includes:

s124, normalizing the index data through MAX-MIN.

2. The modeling method for assessing enterprise growth based on natural language as claimed in claim 1, wherein said step of extracting index data extracted from text content by natural language processing technique comprises:

3. The modeling method for assessing enterprise growth based on natural language as claimed in claim 1, wherein said step of extracting index data extracted from text content by natural language processing technique further comprises:

and counting the number of the business operation ranges of the enterprise and the number of the business operation range changes, and taking the number of the business operation ranges and the number of the business operation range changes as a second modeling index.

4. The modeling method for assessing enterprise growth based on natural language as claimed in claim 1, wherein said step of extracting index data extracted from text content by natural language processing technique further comprises:

5. The modeling method for assessing enterprise growth based on natural language according to claim 1, wherein the step of screening the indicators according to the calculated weight results and the correlation between the indicators comprises:

6. The modeling method for assessing enterprise growth based on natural language according to claim 1, wherein after the step of calculating the weighted sum of the numerical values of the filtered indicators and the corresponding weights, the method further comprises:

if the enterprise growth score is within a preset range, executing step S16;

7. The modeling method for assessing enterprise growth based on natural language as claimed in claim 1, wherein said step of normalizing the weighted sum further comprises:

training and evaluating the growth score of an enterprise through a random forest algorithm, an Xgboost algorithm and an SVM algorithm, and optimally determining a score threshold value divided into a growth stage according to the results of precision ratio, recall ratio, F1 value and AUC value;

and determining the growth stage of the enterprise according to the growth score.

8. The modeling method for assessing enterprise growths based on natural language as claimed in claim 7, wherein said step of training and assessing the growths scores of the enterprises through random forest algorithm, xgboost algorithm and SVM algorithm to optimally determine the score threshold values divided into the growing stages with precision, recall, F1 value and AUC value results comprises:

screening out a scoring combination with the precision ratio, the recall ratio, the F1 value and the AUC value meeting preset requirements according to a random forest algorithm and an Xgboost algorithm; the preset requirements are as follows: the precision ratio is not less than 0.85, the recall ratio is not less than 0.85, the F1 value is not less than 0.85, and the AUC value is not less than 0.9;

9. A modeling system for assessing enterprise growth, the system comprising:

a second calculation module: the weighting device is used for carrying out weight calculation on the screened indexes again and calculating the weighted sum of the numerical values of all the screened indexes and the corresponding weights;

a scoring module: the system is used for carrying out normalization processing on the weighted sum value, and the normalized result is the growth score of the enterprise;

wherein, the data arrangement module includes:

abnormal value query unit: judging whether the abnormal value has corresponding index data within 3 years;

a filling unit: the abnormal value is filled by the average value of the last 3 years when corresponding index data exists;

a normalization unit: the index data is normalized through MAX-MIN.

10. The system of claim 9, wherein the architecture module comprises:

a classification prediction unit: the method is used for performing tendency classification prediction on news information through a BERT model in combination with an LSTM algorithm;