CN111581193A

CN111581193A - Data processing method, device, computer system and storage medium

Info

Publication number: CN111581193A
Application number: CN202010343425.8A
Authority: CN
Inventors: 罗力力; 洪钰; 詹天钰; 白育龙; 孙海容; 罗水权
Original assignee: Ping An Asset Management Co Ltd
Current assignee: Ping An Asset Management Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-08-25

Abstract

The invention provides a big data processing method, equipment, a computer system and a storage medium, wherein data are stored in a database, unstructured data and missing data in the database are extracted, the unstructured data are arranged into structured data, the missing data are subjected to filling processing to obtain processed data, a factor library is constructed from the processed data, the de-trending operation and the standardization operation are carried out on the factor library, and factors in the factor library are used for being input into a training model to operate to obtain a model training result. According to the data processing method, the data processing equipment, the computer system and the storage medium, the unstructured data collected in the database are arranged into structured data, and the missing data is subjected to supplementary processing, so that errors caused by the missing data are prevented, and therefore, non-systematic data input errors are reduced. In addition, the invention also relates to a block chain technology, and the obtained model training result can be stored in the block chain node.

Description

Data processing method, device, computer system and storage medium

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a data processing method, device, computer system, and storage medium.

Background

In the application of computer technology, data processed manually is often required to be used as input, a large amount of human resources are relied on for the input, information and data required in work are numerous and complicated, and data sources are wide. If more data needs to be input and judged manually, the efficiency and the coverage of the system are greatly reduced, and when new information appears in work, the new information cannot be fed back unless the information is supplemented through manual intervention.

Particularly, when various data are applied to credit rating, a traditional rating means often needs much information to achieve, and usually needs a large amount of human resources to simply process the data as input, because of the diversity of information sources, the processing of the data is particularly important, and the information and the data needed by rating are numerous and complicated, the data sources are very wide, and the inconvenience of data intake is also caused.

Disclosure of Invention

It is an object of the present invention to provide a data processing method, device, computer system and storage medium to solve the above-mentioned problems in the prior art.

In order to achieve the above object, the present invention provides a data processing method, comprising the steps of:

storing the data in a database;

extracting unstructured data and missing data in a database, decomposing the unstructured data into structured data, and performing filling processing on the missing data to generate processed data;

and classifying the processed data, constructing a factor library, and performing de-trending operation and standardization operation on the factor library, wherein the factors in the factor library are used for being input into a training model to operate to obtain a model training result.

Further, deconstructing unstructured data in the database into structured data includes deconstructing data information using text mining, optical character recognition, and semantic analysis to collate unstructured data into structured data.

Further, techniques employed to store data in the database include utilizing crawler technology and big data technology.

Further, the detrending operation uses a fitted straight line, plane or curved surface that subtracts a least square from the data, so that the mean value of the detrended data is zero.

Further, the normalization operation processes the maximum value and the minimum value of the factor to define the upper and lower limits of the factor.

Further, the method for filling missing data comprises filling by adopting a statistical principle and a machine learning method, wherein the filled data values comprise default values, mean values, median values, mode values, data of upper and lower strips, data obtained by interpolation, adjacent algorithm data or predicted values.

Further, the method comprises analyzing the factors in the factor library by using a gravity center method, an image analysis method, a maximum likelihood solution, a least squares method, an alpha extraction method or a Lao classical extraction method.

In order to achieve the purpose, the invention provides data processing equipment which comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for storing data in a database, the data processing module is used for extracting unstructured data and missing data in the database, the unstructured data are decomposed into structured data, the missing data are subjected to padding processing to generate processed data, the data processing module is used for classifying the processed data to construct a factor library, the factor library is subjected to detrending operation and standard operation, and factors in the factor library are used for being input into a training model to operate to obtain a model training result.

In order to achieve the above object, the present invention further provides a computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processors of the plurality of computer devices collectively implementing the steps of the aforementioned method when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor implements the steps of the aforementioned method.

By adopting the technical scheme, compared with the prior art, the invention has the following beneficial effects:

the invention provides a data processing method, a device, a computer system and a storage medium, which are characterized in that unstructured data collected in a database are deconstructed into structured data, missing data is subjected to supplementary processing, the structured data is beneficial to subsequent arrangement of the data, errors caused by the missing data are prevented, non-systematic data input errors are reduced, accuracy is ensured, a factor library is constructed by the processed data to carry out classification integration on the data, factors in the factor library are input into a training model to operate to obtain a model training result, and the working efficiency and the coverage of data processing are improved through the factor library.

Drawings

FIG. 1 is a flow chart of a data processing method of the present invention;

FIG. 2 is a block diagram of one embodiment of a data processing apparatus of the present invention;

fig. 3 is a hardware architecture diagram of one embodiment of the computer apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The scheme can be used for processing data in various industries, for example, the scheme can be applied to the credit rating process, and the objective evaluation of the condition of the object to be evaluated is realized in the model through processing and processing the collected data, which is described in detail below.

Example one

Referring to fig. 1, a data processing method of the present invention is shown, including the steps of:

and step S10, a data acquisition process, namely combing the data source, and storing the data of the mass market information in a database. In the process of combing the data sources, the data sources can comprise various sources such as webpage information, enterprise account statements, internal and external reports, published specifications, distributed bulletins and government published bulletins, any valuable information can be used as a source for collecting data, PDFs, websites, webpage positions and the like containing the data are all contained in a database, and mass market information can be stored in the database by using a crawler technology and a big data technology and can be called in time.

The crawler technology is mainly used for web pages, can automatically browse information in the network, is widely used for internet search engines or other similar websites to acquire or update the content and retrieval mode of the websites, and can automatically acquire all page contents which can be accessed by the websites. Information stored in various websites in a data source is acquired by using a crawler technology to form structured data, and the process is as follows: the method comprises the steps of obtaining an address of a website, obtaining a user-agent, requesting url, obtaining a response, obtaining specified data in a source code, processing the data, beautifying the data and storing the data. The big data technology can utilize distributed storage and distributed processing to perform batch processing calculation, flow calculation, graph calculation and query analysis calculation, and realize the functions of batch processing of large-scale data, real-time calculation of flow data, processing and storage processing query analysis of graph structure data and the like.

And step S20, a data processing process, namely extracting unstructured data and missing data in the database, decomposing the unstructured data into structured data, and performing filling processing on the missing data by using a statistical principle and a machine learning method to generate processed data. After the data is collected into the database, the standard structured data can be directly applied, however, the collected data often includes a large amount of unstructured data, further processing is needed, specifically, data information can be deconstructed through text mining, OCR and semantic analysis, the unstructured data in the database is trained and processed, that is, the data information parsed from the unstructured data is reconstructed into structured data, so that the unstructured data is arranged into structured data.

Structured data is also referred to as quantitative data, and is information that can be represented by data or a unified structure, such as numbers and symbols, and in a project, the data for storing and managing the data is generally a relational database, and the sources of the structured data are such as enterprise ERP, financial system, medical HIS database, and educational all-purpose card; government administration approval, other core databases, etc., which are easily searched by computer programs when using structured query language or SQL, which has a clear relationship that makes it convenient to use.

The unstructured data essentially comprises all other data besides structured data, usually does not conform to any predefined model, and is generally stored in a non-relational database, and can be textual or non-textual, and can also be man-made or machine-generated, various documents, pictures, videos/audios and the like belong to the unstructured data, the unstructured data is represented as data with variable fields, and the required structured data can be formed by extracting and fixing the unstructured data. Unstructured data may include forms of video, audio, pictures, images, documents, text, etc. that may result from specific applications such as medical imaging systems, educational video-on-demand, video surveillance, homeland GIS, design houses, file servers (PDM/FTP), media asset management, etc.

Performing OCR processing on picture information existing in PDF or a webpage, and extracting useful information from disordered data to form structured data, wherein the process comprises the following steps: inputting an image; preprocessing (including binarization, noise removal, tilt correction, etc.); layout analysis (segmenting document pictures into sections); character cutting; character recognition; and (5) post-processing and checking. For example, NLP parsing can also be performed on the advertisement class PDF information in the data source, and the process is as follows: important information in a certain amount of issued files is firstly extracted manually to form structured data which can be used as training samples.

The text mining can acquire valuable information and knowledge from text data, can realize classification and clustering of the text, the former is a supervised mining algorithm, the latter is an unsupervised mining algorithm, firstly acquires the text, imports the file data, then preprocesses the file, eliminates noise files to improve the mining precision, or only selects a part of samples to improve the mining efficiency when the number of the files is excessive, then performs linguistic processing of the text, including operations of word segmentation, part of speech tagging and the like, and finally performs feature extraction on the text to obtain the required file information. Semantic analysis mainly understands semantic information such as meanings, themes, categories and similarity of words, sentences and sections, and in natural language processing, the semantic analysis mainly comprises two types of semantic analysis based on knowledge or semantic rules and semantic analysis based on statistics.

The information in these databases may also have missing data, i.e., data corresponding to the missing data has default values or cannot be collected for various reasons, and the missing data needs to be processed in data processing, and the processed data can be obtained by performing completion and processing through statistical principles and machine learning methods. The statistical principle is that data rules are measured, collected, sorted, induced and analyzed on the basis of data analysis, various analysis indexes are calculated, the standard for data measurement comprises the average number, the median number, the mode number, the expectation, the variance, the standard deviation, the standard grade and the like, so as to reveal the essence of the social and economic activity process and the regularity of development change of the social and economic activity process, the probability distribution of the statistical principle comprises geometric probability, binomial distribution, normal distribution, poisson distribution and the like, the correlation among variables comprises positive linear correlation, negative linear correlation, quantitative relations such as optimal fit line prediction, linear regression, logistic regression and the like.

The machine learning method is characterized in that rules and essences of objects behind are mined by using an algorithm through data training, the rules and essences can be composed of data, the algorithm and a model, the model in the machine learning method is used for digitally abstracting the real world rules and essences, the machine learning method can select one algorithm suitable for a business scene from a plurality of algorithms according to the business scene and experience, and the algorithm in the machine learning method comprises a decision tree, naive Bayes, an SVM, KNN, linear regression, logistic regression, a neural network and the like. In the learning mode, the machine learning method can be divided into supervised learning, unsupervised learning and semi-supervised learning, and the main difference is that the correct answer is not informed to the machine in the training set. The training set gives the supervised learning of correct answers and is mainly used for classification prediction and regression prediction; the training set which does not specify correct answers is unsupervised learning and is mainly used for clustering.

For the missing value of the missing data, the following various missing value filling methods can be adopted for training and testing, and one of the methods can be selected under different scenes: filling in a default value; filling an average value; filling a median; filling mode; filling data of upper and lower bars; filling data obtained by interpolation; fill-in proximity algorithm data (calculate adjacent k data using knn, then use their mean as the fill-in value); the predicted values may also be populated.

A pile of samples can be collected and randomly divided into a training set and a testing set, for example, an independent variable is a pile of enterprise financial data, and a dependent variable is an enterprise rating result; assuming that a certain financial ratio (such as the gross profit margin) of some enterprises is missing, the missing value filling method with the best test effect can be selected by judging the performance of the test set in the same calculation manner by adopting the preset missing value filling methods.

After the data information collection and processing processes, each data information has a corresponding name or id obtained in the collection and the time of information distribution, the identifications become anchor points for automatically integrating the information, and the information among the 2 working nodes is collected into a data sample at the tail end of the time interval. All data are collected into a sample, the identifier of each sample is a name or an ID and a processing result of a certain time point, and qualitative information and quantitative information of the time point interval are collected into the sample data.

Data extraction can be adopted during data processing, existing fields in a database can be integrated and processed, and therefore data required by analysis can be formed, and the field splitting, field merging and field matching can be included; the field combination is that a plurality of fields are synthesized into a new field, or the field value is combined with characters, numbers and the like to form a new field; the field matching is to obtain the required data from the associated databases with the same field, generally speaking, the field matching requires that at least one associated field exists between the original database and the associated database, and the data corresponding to the batch query matching is realized according to the associated fields.

Because data from different sources may have different structures, data conversion can be adopted during data processing, mainly refers to converting the data into a standard, clear and easily analyzed structure, and can include structure conversion and row-column conversion, the structure conversion is needed to perform the structure conversion on the data according to different business requirements, and mainly refers to the conversion between a one-dimensional data table and a two-dimensional data table; when data analysis reports are performed, data are often observed from different dimensions, for example, summarized data are observed from a time dimension or summarized data are observed from a regional dimension, so that row and column data need to be converted.

Sometimes, the required fields are not available in the database and can be obtained after calculation is carried out through the existing fields, and data calculation is required to be adopted in data processing at the moment, and the data processing mainly comprises simple calculation and date and time calculation. Simple calculation is to perform operations such as addition, subtraction, multiplication, division and the like on the data value and generate a new field; the date and time data calculation is the management analysis of the date and time data in the enterprise management, and is also an important type of data in the database.

And step S30, a data processing process, namely, classifying the processed data, constructing a factor library, and performing de-trending operation and standardization operation on the factor library, wherein the factors in the factor library are used for being input into a training model to operate so as to obtain a model training result. When the processed data is utilized, the information can be automatically integrated, and a calculation result is given through the model. The factors in the established factor library comprise key information or indications, for example, various key information or indexes extracted from various rating reports and research reports in enterprise information, and also can be indexes accumulated by expert experience, including net interest rate, asset liability rate, enterprise property, income increase rate, asset scale, administrative level of a city investor back to the government and the like, and the historically used and possibly useful factors are saved and can be obtained from quotation data, financial data, industry data, expected data and the like.

In this embodiment, taking economics as an example, the factor library may be divided into four categories of financial, market, expectation and other factors, and after several categories of factors are determined, the categories of factors are subdivided, and the financial factors may include an evaluation factor, a scale factor, a growth factor, a profit factor, an operation capacity factor, a lever factor, and the like; the market factors can include stock price factors, momentum factors, technical factors, fluctuation factors, liquidity factors and the like; the expected class factor comprises a profit expectation factor and the like; other classes of factors include stockholder related factor and Beta factor, among others. In other embodiments, the corresponding factor library may be generated by sorting the data samples.

In this embodiment, for an enterprise, the content of the existing factors can be divided into an enterprise natural condition, an asset profit condition, a transaction condition, a financing and financing condition, a social credit system credit condition, a company risk client list and the like, the enterprise natural condition comprises registered capital and unit properties, the asset profit condition comprises last year profit, company net assets, company flow rate and company liability rate, the transaction condition comprises profit capacity, turnover rate and investment preference, the financing and financing condition comprises the financing condition and the financing condition, the factors of commonality are extracted from data for statistical processing, and the variables of the same essence are classified into one factor. The factors can be obtained by comparing variables in each sample and eliminating sample differences, and factor analysis can be performed on the factors in the factor library, for example, a gravity center method, an image analysis method, a maximum likelihood solution, a least squares method, an alpha extraction method, a Lao classical extraction method and the like are adopted, so that the factors formed by corresponding information are corresponding to the factor library, and the possibly useful factors also comprise external information or newly generated data, such as industrial emergencies, international events, major technical breakthroughs and the like.

In order to improve the accuracy of data processing, the factor library is subjected to detrending and standardization operations. The detrending can adopt a fitting straight line, a plane or a curved surface which subtracts an optimal (least square) from the data, so that the mean value of the detrended data is zero, and the detrending can eliminate the influence of the offset generated when the data is acquired on the later calculation, so that the data processing is centralized on the fluctuation of the data trend. The de-trending specifically comprises the following steps: (1) in the time sequence x (t) with the length of N, t is 1, 2, … … and N, the accumulated difference is calculated and converted into a new sequence

Wherein the content of the first and second substances,

is the average of the time series:

(2) dividing y (t) into non-overlapping m intervals with equal length n, wherein n is interval length, i.e. time scale, m is interval (or window) number, i.e.

The integer part of (1); (3) the local trend y is fitted by adopting least square normalcy to each segment sequence_n(t); (4) rejecting local trends y for each interval_n(t) and calculating the root mean square of the new sequence:

(5) and (5) repeating the steps (2), (3) and (4) by changing the size of the window length n, thereby obtaining the data after eliminating the linear trend.

The normalization operation may process the maximum and minimum values of the factor to define upper and lower limits of the factor, and normalize the factor value (sigmoid function processing to conform to a distribution with a mean value of 0 and a standard deviation of 1). For example, as for the net profit margin, the factor of net profit margin is calculated according to financial data, which is equal to net profit/income, and then according to the distribution range of the net profit margin of the sample company, the lowest 5% quantile and the highest 95% quantile are determined to be minimum and maximum values, and are less than or greater than the extreme value, and are processed according to the extreme value uniformly; if applicable, using the formula

And processing the net profit rate according to the sigmoid function so as to ensure that the net profit rate conforms to the standard normal distribution.

Furthermore, the accuracy of data processing can be improved through feature engineering, an integrated method is adopted, machine learning technologies such as xgboost are utilized, the most useful factors can be screened out according to the degree of influence on the rating result (for example, top50), the most useful factors can be used as feature engineering, data useful for the rating result is analyzed and extracted, and the factors are generated and stored in a factor library.

The data processing can be optimized by using a training model, the training model enables the model to perform better in a training set by setting parameters of the model, the root mean square error can be used on a regression model, and in order to train a linear regression model, a theta value is required to be found, and the theta value enables the root mean square error (standard error) to reach the minimum value. In the training model, the incidence relation between the factors and the processing results can be searched through a machine learning method, the training model can automatically select the factors from a factor library, calculate and give the processing results, when new information is updated, the model can be triggered to recalculate, so that the latest dynamic state and the processing results are reflected, full coverage in the whole time period can be achieved, the scheme can be applied to any training model after data is collected, processed and processed, the application of the specific training model is based on the scheme for processing the data, and further, the obtained model training results can be stored in a block chain network.

Based on the factor library and the characteristic engineering, the training automation of the training model can be realized, and when the data sample is updated, the training model can adjust the parameters again according to the model evaluation standard and the rule, so that the model can realize automatic iterative updating. N models (such as various methods of machine learning, including random forests, xgboost, svm, Bayesian classifiers and the like) can be deployed on the system, the models can perform supervised learning based on data X and Y of a training sample set, and the model with the highest c value is automatically selected according to the test result of each model test set. Under specific trigger conditions, the model can also automatically change the structure and input data, and effective factors are screened again from the factor library. When updating the effective factors, the factors with higher significance can be selected as the input data set by using feature engineering. And when the model structure is updated, sequentially testing the models on the list, selecting the optimal model according to the model evaluation standard, and finishing automatic updating.

Example two

As shown in fig. 2, a data processing apparatus 10 according to the present embodiment is shown, and includes a data acquisition module 11, a data processing module 12, and a data processing module 13, where the data acquisition module 11 is configured to store data in a database, the data processing module 12 is configured to extract unstructured data and missing data in the database, deconstruct the unstructured data into structured data, perform supplementary processing on the missing data, and generate processed data, the data processing module 13 is configured to classify the processed data, construct a factor library, perform a detrending operation and a standardization operation on the factor library, and a factor in the factor library is used to be input into a training model and run to obtain a model training result.

In this embodiment, the data collection module 11 collects data generated from actual data of social and economic activities, and may obtain corresponding data and store the data in the database by using a crawler technology and a big data technology, the crawler technology may capture specified website data, such as shopping evaluations on shopping websites, for a specific website or App, the big data technology is to extract, convert, and load the data, and finally mine the potential value of the data, and form a processing and analyzing object by combing a data source, any valuable information may be used as a source of the collected data, the data collection is located in an important part of a data analysis processing life cycle, and various types of structured and unstructured mass data may be obtained by means of sensor data, social network data, mobile internet data, and the like, and the collected data may be various, many have not been directly available for data analysis processing. At present, the common data acquisition modes of the internet comprise APP acquisition and web side acquisition, and the most common mode for APP acquisition is through integrated SDK; the web-side collection can be performed by embedded point collection, cookie information of a client side or a browser and the like are collected by using JS, the cookie information or the browser and the like is sent to a group of servers in the background, a plurality of websites are found to check collected data of the clients, and the web-side collection can be performed by web service recording, JS embedded collection or packet sniffer.

The data stored in the database are processed by the data processing module 12, unstructured data in the database are decomposed into structured data, the unstructured data are not conveniently represented by a database two-dimensional logic table due to the fact that the data structure is irregular or incomplete and no predefined data model exists, data sources are enriched by converting the unstructured data into the structured data, a plurality of data are effectively utilized, and the data processing accuracy is improved. Meanwhile, the missing data is subjected to filling processing, data missing is often generated in the data generation or data acquisition process, if the missing data is not processed, errors can be caused in data processing, the processed data obtained through filling processing enables the data to reduce noise, and the compatibility of characteristic data is improved. The data processing module reduces non-systematic data input errors and improves the accuracy of data processing.

After the processed data is obtained, a factor library is constructed through the data processing module 13, the processed data can be used for analysis and processing, the de-trending operation and the standardization operation are carried out on the factor library to enable the factor library to become pure data, the factors in the factor library are used for being input into a training model to operate to obtain a model training result, and therefore a calculation result can be objectively obtained.

EXAMPLE III

As shown in fig. 3, the computer system includes a plurality of computer devices 20, in the second embodiment, the components of the data processing device may be distributed in different computer devices 20, and the computer device 20 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster formed by a plurality of servers) that executes a program. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22 communicatively connected to each other by a system bus. It is noted that fig. 3 only shows the computer device 20 with components 21-22, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.

In the present embodiment, the memory 21 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 100, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 20. Of course, the memory 21 may also include both internal and external storage devices of the computer device 20. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed in the computer system device, such as a c language thread of the second embodiment and program codes of a python algorithm model. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data. The data processing method of the first embodiment is implemented when the processors 22 of the plurality of computer devices 20 of the computer system of the present embodiment collectively execute the computer program.

Example four

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer readable storage medium of this embodiment stores the data processing device 10 of the second embodiment, and when executed by a processor, implements the data processing method of the first embodiment. Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The embodiment of the scheme of the invention can be applied to the credit rating process, the information extraction process is fully automated by utilizing the identification technology and the mining technology, most of human resources can be saved, non-systematic data input errors can be reduced to the maximum extent in the process, and the accuracy is ensured. The rating efficiency and the coverage can be improved, and the problems of slow updating and low coverage rate are greatly improved because the current rating needs to depend on a large amount of manual operation. Due to the optimization of data processing, the information can be updated on the same day, and enterprise rating is updated on the same day in the whole market. The standardization of the rating result is realized, the model can automatically adapt to market change, the objectivity of the rating result is ensured, the rating quality can be controlled, and the result is standardized.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data processing method, comprising the steps of:

storing the data in a database;

2. The data processing method of claim 1, wherein deconstructing unstructured data into structured data comprises deconstructing data information using text mining, optical character recognition, and semantic analysis to collate unstructured data into structured data.

3. The data processing method of claim 1, wherein the techniques employed to store the data in the database include utilizing crawler technology and big data technology.

4. The data processing method of claim 1, wherein the detrending operation uses a fitted line, plane or curved surface obtained by subtracting a least square from the data to make the mean of the detrended data zero.

5. The data processing method of claim 1, wherein the normalizing operation processes the maximum and minimum values of the factor to define the upper and lower limits of the factor.

6. The data processing method of claim 1, wherein the method of filling missing data comprises filling in by using statistical principles and machine learning methods, and the filled data values comprise default values, mean values, median values, mode values, top and bottom data, interpolated data, proximity algorithm data or predicted values.

7. The data processing method of claim 1, further comprising analyzing the factors in the factor library using a method selected from the group consisting of centroid method, image analysis method, maximum likelihood solution, least squares method, alpha extraction method, and Lao classical extraction method.

8. A data processing apparatus, characterized by comprising:

the data acquisition module is used for storing data in a database;

the data processing module is used for extracting unstructured data and missing data in the database, decomposing the unstructured data into structured data, and performing filling processing on the missing data to generate processed data;

and the data processing module is used for classifying the processed data, constructing a factor library, and performing de-trending operation and standardization operation on the factor library, wherein the factors in the factor library are used for being input into a training model to operate to obtain a model training result.

9. A computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processors of the plurality of computer devices collectively implement the steps of the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium comprising a stored data area storing data created from use of blockchain nodes and a stored program area storing a computer program, wherein the computer program stored by the storage medium implements the steps of the method of any one of claims 1 to 7 when executed by a processor.