CN116109201A

CN116109201A - Data processing method and related device

Info

Publication number: CN116109201A
Application number: CN202310185560.8A
Authority: CN
Inventors: 王国旭
Original assignee: Beijing Shangzhi Compliance Technology Co ltd
Current assignee: Beijing Shangzhi Compliance Technology Co ltd
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-05-12

Abstract

The application discloses a data processing method and a related device, which are used for acquiring a data analysis request comprising a target user identifier, and determining n dependent variable indexes corresponding to the target user identifier according to the corresponding relation between the user identifier and the dependent variable indexes. The method comprises the steps of obtaining i undetermined independent variable indexes corresponding to target dependent variable indexes, a target dependent variable data set corresponding to the target dependent variable indexes and a undetermined independent variable data set corresponding to the i undetermined independent variable indexes. According to the correlation between the target dependent variable data set and the i independent variable data sets, j independent variable indexes are determined from the i independent variable indexes, namely j independent variable indexes which have larger correlation with the target dependent variable indexes are determined from the i independent variable indexes possibly related to the target dependent variable indexes, screening of irrelevant indexes is realized, and accordingly an adjustment strategy for the target dependent variable indexes is determined according to the j independent variable indexes in a targeted manner.

Description

Data processing method and related device

Technical Field

The invention relates to the technical field of big data processing, in particular to a data processing method and a related device.

Background

Quality management extends through the whole life cycle of the medicine, and pharmaceutical enterprises are required to continuously perform total analysis on quality problems in the medicine production process so as to ensure the reliability of medicine quality.

In the related art, when pharmaceutical enterprises need to review the quality, the quality problems in the last half year can be summarized and classified, and medical experts can attribute the problems and adjust the production of the enterprises in the next half year according to the reasons.

However, the above-described method is extremely dependent on the expertise of medical professionals, and has low universality.

Disclosure of Invention

Aiming at the problems, the application provides a data processing method and a related device, which do not need manual participation, and expand the application range while ensuring the accuracy of analysis.

Based on this, the embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a data processing method, where the method includes:

acquiring a data analysis request, wherein the data analysis request comprises a target user identifier;

according to the corresponding relation between the user identification and the dependent variable indexes, n dependent variable indexes corresponding to the target user identification are obtained, wherein n is an integer greater than 1;

Obtaining i undetermined independent variable indexes corresponding to a target dependent variable index, wherein the target dependent variable index is one dependent variable index in the n dependent variable indexes, and i is an integer greater than 1;

acquiring a target dependent variable data set corresponding to the target dependent variable index, and acquiring i independent variable data sets corresponding to the i independent variable indexes respectively;

according to the correlation between the target dependent variable data set and the i independent variable data sets, j independent variable indexes are determined from the i independent variable indexes, the correlation between j independent variable data sets corresponding to the j independent variable indexes and the target dependent variable data set is greater than a correlation threshold, and j is a positive integer less than or equal to i;

and determining an adjustment strategy aiming at the target dependent variable index according to the j independent variable indexes to be determined.

Optionally, the determining, according to the j independent variable indexes to be determined, an adjustment policy for the target dependent variable index includes:

acquiring historical self-variable data corresponding to the j independent variable indexes to be determined and historical dependent variable data corresponding to the target dependent variable indexes;

Predicting the change trend of the target dependent variable index according to the historical self-variable data and the historical dependent variable data;

and determining an adjustment strategy aiming at the target dependent variable index according to the change trend.

determining k independent variable index combinations according to the j independent variable indexes, wherein the independent variable index combinations comprise at least one independent variable index to be determined in the j independent variable indexes, and k is a positive integer greater than 1;

establishing association relations between the k independent variable index combinations and the target dependent variable indexes respectively to obtain fitting goodness and saliency degrees corresponding to the k association relations respectively;

determining an optimal independent variable index combination from the k independent variable index combinations according to the k fitting goodness and the significance degree, wherein the fitting goodness of the optimal independent variable index combination meets a preset goodness condition;

and determining an adjustment strategy aiming at the target dependent variable index according to the undetermined independent variable index included in the optimal independent variable index combination.

Optionally, the obtaining the target dependent variable data set corresponding to the target dependent variable index includes:

acquiring a full-quantity dependent variable data set corresponding to the target dependent variable index;

and if the whole dependent variable data set is determined not to accord with normal distribution, taking the other dependent variable index of the n dependent variable indexes as the target dependent variable index.

Optionally, the obtaining the i independent variable indexes to be determined corresponding to the target dependent variable index includes:

acquiring x full independent variable indexes corresponding to the target dependent variable indexes;

acquiring x full-automatic variable data sets according to the x full-automatic variable indexes;

if it is determined that abnormal self-variable data sets which do not accord with normal distribution exist in the x total self-variable data sets, removing the total independent variable index corresponding to the abnormal self-variable data sets from the x total independent variable indexes to obtain i undetermined independent variable indexes.

Optionally, after the obtaining the i pending self-variable data sets, the method further includes:

determining i time distributions respectively corresponding to the i undetermined self-variable data sets;

according to the continuity of the i time distributions, p third abnormal data with abnormal continuity are determined from the i undetermined self-variable data sets, wherein p is an integer smaller than i and larger than or equal to 0;

Determining p undetermined independent variable indexes corresponding to the p third abnormal data respectively;

removing the independent variable data sets which correspond to the p independent variable indexes respectively from the i independent variable data sets to obtain i-p independent variable data sets;

determining j independent variable indexes from the i independent variable indexes according to the correlation between the target dependent variable data set and the i independent variable data sets, including:

and determining j independent variable indexes from the i-p independent variable indexes according to the correlation between the target dependent variable data set and the i-p independent variable data sets.

Optionally, the method further comprises:

acquiring a user identifier creation request;

acquiring the n dependent variable indexes determined from a dependent variable index set, wherein the dependent variable index set comprises m dependent variable indexes, and m is an integer greater than n;

and establishing the corresponding relation between the n dependent variable indexes and the target user identifier, and returning a creation result corresponding to the user creation request, wherein the creation result comprises the target user identifier.

In another aspect, the present application provides a data processing apparatus, the apparatus comprising: the device comprises an acquisition unit, a determination unit and an adjustment unit;

The acquisition unit is used for acquiring a data analysis request, wherein the data analysis request comprises a target user identifier;

the acquisition unit is further used for acquiring n dependent variable indexes corresponding to the target user identifier according to the corresponding relation between the user identifier and the dependent variable indexes, wherein n is an integer greater than 1;

the acquisition unit is further used for acquiring i undetermined independent variable indexes corresponding to a target dependent variable index, wherein the target dependent variable index is one dependent variable index in the n dependent variable indexes, and i is an integer greater than 1;

the acquisition unit is further used for acquiring a target dependent variable data set corresponding to the target dependent variable index and acquiring i independent variable data sets corresponding to the i independent variable indexes respectively;

the determining unit is configured to determine j independent variable indexes from the i independent variable indexes according to the correlation between the target dependent variable data set and the i independent variable data sets, where the correlation between j independent variable data sets corresponding to the j independent variable indexes and the target dependent variable data set is greater than a correlation threshold, and j is a positive integer less than or equal to i;

The adjusting unit is used for adjusting the target dependent variable index according to the j independent variable indexes to be determined.

Optionally, the adjusting unit is specifically configured to:

Optionally, the acquiring unit is specifically configured to:

if it is determined that abnormal self-variable data sets which do not accord with normal distribution exist in the x total self-variable data sets, removing the total independent variable index corresponding to the abnormal self-variable data sets from the x total independent variable indexes to obtain i undetermined independent variable indexes. Optionally, the apparatus further comprises a screening unit for:

the determining unit is specifically configured to:

Optionally, the acquiring unit is further configured to:

acquiring a user identifier creation request;

the apparatus further comprises a response unit for:

In another aspect, the present application provides a computer device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

The processor is configured to perform the method of the above aspect according to instructions in the program code.

In another aspect, the present application provides a computer readable storage medium for storing a computer program for performing the method of the above aspect.

In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method described in the above aspect.

The technical scheme has the advantages that:

and acquiring a data analysis request comprising the target user identifier, and determining n dependent variable indexes corresponding to the target user identifier according to the pre-established corresponding relation between the user identifier and the dependent variable indexes. Taking a target dependent variable index in n dependent variable indexes as an example, acquiring i independent variable indexes corresponding to the target dependent variable index, acquiring a target dependent variable data set corresponding to the target dependent variable index, and acquiring the independent variable data set corresponding to the i independent variable indexes. According to the correlation between the target dependent variable data set and the i independent variable data sets, j independent variable indexes are determined from the i independent variable indexes, namely j independent variable indexes which have larger correlation with the target dependent variable indexes are determined from the i independent variable indexes possibly related to the target dependent variable indexes, screening of irrelevant indexes is realized, and accordingly an adjustment strategy for the target dependent variable indexes is determined according to the j independent variable indexes in a targeted manner. Therefore, through calculating the correlation between the independent variable indexes and the dependent variable indexes, the independent variable indexes to be determined, which have larger correlation with the target dependent variable indexes, are selected from the multiple independent variable indexes to be determined, and automatic attribution to each dependent variable index is realized, and targeted adjustment is further realized. The method does not need to be manually participated, and expands the application range while ensuring the accuracy of analysis.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 3 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The data processing method provided by the application can be applied to data processing equipment with data processing capability, such as terminal equipment and a server. The terminal device may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent household appliance, etc., but is not limited thereto; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

A data processing method provided in an embodiment of the present application is described below with reference to fig. 1. Referring to fig. 1, the flowchart of a data processing method according to an embodiment of the present application may include S101-S106.

S101: a data analysis request is obtained.

The data analysis requests corresponding to different pharmaceutical enterprises may be different. For example, an A pharmaceutical enterprise produces an A drug, a B pharmaceutical enterprise produces a B drug, the production process flows between the A drug and the B drug differ, and the quality management of the A drug by the A pharmaceutical enterprise may differ from the quality management of the B drug by the B pharmaceutical enterprise. Therefore, different user identifications can be established for different pharmaceutical enterprises, so that the data processing equipment can distinguish the sender of the data analysis request through the user identifications, and further the data analysis requests respectively corresponding to the different pharmaceutical enterprises can be defined.

The data analysis request comprises a target user identifier, wherein the target user identifier is one user identifier in a plurality of user identifiers, and the user identifier is used for distinguishing a sender of the data analysis request.

The embodiment of the application is not particularly limited to the user identifier, and a person skilled in the art can set the user identifier according to actual needs. For example, the user identifier may be an enterprise account established by the enterprise in the data processing device, and the pharmaceutical enterprise may log into a data processing platform provided by the data processing device through the enterprise account and send a data analysis request to the data processing device.

As one possible implementation, each pharmaceutical enterprise may have a primary enterprise account, and then multiple secondary user accounts may be created under the primary enterprise account, so as to distribute to staff at different nodes in the pharmaceutical production process flow, so as to ensure data security.

S102: and acquiring n dependent variable indexes corresponding to the target user identification according to the corresponding relation between the user identification and the dependent variable indexes.

n is an integer greater than 1, and the dependent variable index is an index of quality management required by pharmaceutical enterprises, such as complaints, primary qualification rate, deviation repetition rate and the like. Where complaints are user product complaints refer to any written, electronic, or spoken announced defect related to the composition, quality, durability, reliability, safety, effectiveness, or appearance of the product on the market. The primary yield refers to the percentage of batches that do not require reworking or rework. The deviation repetition rate means the number of deviations that repeatedly occur divided by the total number of deviations, and the repetition rate is always counted in the year in which the deviation first occurs. Assuming that some deviation occurs first in 2017 and second in 2018, the number of deviations in both 2017 and 2018 is increased by 1, but the repeated deviations only count up to 2017. Furthermore, the repeated deviations were counted only for those 12 months after the first deviation. Otherwise, the deviation is considered a new deviation.

In order to help pharmaceutical enterprises to quickly perform quality management, the embodiment of the application pre-establishes a corresponding relationship between user identifiers and dependent variable indexes. Through the corresponding relation between the user identification and the dependent variable indexes, n dependent variable indexes corresponding to the target user identification can be obtained.

As a possible implementation manner, the correspondence between the user identifier and the dependent variable index may be established in the process of creating the user identifier. See in particular S201-S203:

s201: a user identification creation request is obtained.

In practical application, pharmaceutical enterprises can log in a data processing platform provided by the data processing equipment, and corresponding user identifications are created on the data processing platform. For example, by clicking on the "create account" function, a user identification creation request is sent to the data processing device.

S202: n dependent variable indicators determined from the set of dependent variable indicators are obtained.

After the pharmaceutical enterprise sends a user identification creation request to the data processing equipment through the terminal equipment, the terminal equipment displays a dependent variable index set to the pharmaceutical enterprise, wherein the dependent variable index set comprises m dependent variable indexes, the pharmaceutical enterprise can select n dependent variable indexes to be analyzed from the dependent variable index set in a clicking mode and the like, and the n dependent variable indexes are sent to the data processing equipment.

n is an integer greater than or equal to 1, and m is an integer greater than or equal to n.

S203: and establishing corresponding relations between the n dependent variable indexes and the target user identifications, and returning a creation result corresponding to the user creation request.

The user identifier created by the data processing device for the pharmaceutical enterprise is a target user identifier, and a creation result comprising the target user identifier is returned to the terminal device. Meanwhile, as the pharmaceutical enterprise selects n dependent variable indexes, the data processing equipment can establish the corresponding relation between the target user identification and the n dependent variable indexes, so that after the data analysis request is acquired, the n dependent variable indexes corresponding to the pharmaceutical enterprise, namely the data analysis requirement of the pharmaceutical enterprise, are determined according to the target user identification carried in the data analysis request.

Therefore, through pre-establishing the corresponding relation between the user identification and the dependent variable index, the system can help pharmaceutical enterprises log in through the user identification without submitting data analysis requirements again, and the data processing equipment can determine the data analysis requirements corresponding to the pharmaceutical enterprises according to the user identification, so that the data analysis speed and the user experience sense of the pharmaceutical enterprises are improved.

It should be noted that, the pharmaceutical enterprise can subsequently change the data analysis requirement, so as to change the corresponding relationship between the user identifier and the dependent variable index, thereby meeting the requirement of the pharmaceutical enterprise on the change of the production process flow.

S103: and obtaining i undetermined independent variable indexes corresponding to the target dependent variable indexes.

Since different dependent variable indices may have an association relationship with different independent variable indices, analysis is required for each of the n dependent variable indices, and for convenience of explanation, a target dependent variable index will be described below as an example. Wherein the target dependent variable index is one of n dependent variable indexes, and i is an integer greater than 1.

The pending argument index is an argument index that may have an association relationship with the target argument index, and the argument index is an index that may affect the argument index, such as a change and a deviation. The change refers to the change of the chemical approved to be marketed in the aspects of production, quality control, use condition and the like, which relates to the aspects of source, method, control condition and the like. These variations may affect the safety, effectiveness and quality control of the drug product. Deviation refers to any situation that deviates from an approved procedure (guideline) or standard. The term "standard" as used herein refers to various technical standards established by pharmaceutical enterprises for achieving pharmaceutical quality, including, but not limited to, analytical testing standards for materials. The technical standard may be embodied in various file forms, may be part of the program file directly, may be a stand-alone technical standard file, or may be embodied in a controlled template or other suitable form. Program (instruction) referred to herein refers to program files of generalized "Production" activities, and deviations from non-Production "class of programs (e.g., warehouse programs and laboratory programs) may also completely result in adverse effects on product quality.

As a possible implementation manner, all the independent variable indexes required by the pharmaceutical enterprise can be used as the independent variable indexes to be determined, or the independent variable indexes which are not related to the target independent variable indexes can be removed from all the independent variable indexes in advance, and then the independent variable indexes to be determined are determined.

S104: the method comprises the steps of obtaining a target dependent variable data set corresponding to target dependent variable indexes, and obtaining i independent variable data sets respectively corresponding to i independent variable indexes to be determined.

After the target dependent variable index and the independent variable index to be determined are determined, at least one data corresponding to the target dependent variable, namely a target dependent variable data set, is obtained. And acquiring at least one data corresponding to the index of the independent variable to be determined, namely a set of data of the independent variable to be determined. The method comprises the steps of obtaining i independent variable data sets according to i independent variable indexes to be determined.

It should be noted that, the target dependent variable data set and the i pending independent variable data sets may be data uploaded by staff of each node in a production process flow of the pharmaceutical enterprise. As one possible implementation manner, the staff of each node has a user account under the enterprise account of the pharmaceutical enterprise, and after logging in the data processing platform through the user account, the data processing platform displays the independent variable index and the dependent variable index of the data to be uploaded to the user. The data processing platform stores the data uploaded by the user in the data processing device for subsequent analysis and retrieval.

S105: and determining j independent variable indexes from the i independent variable indexes according to the correlation between the target dependent variable data set and the i independent variable data sets.

The method for calculating the correlation between the target dependent variable data set and the data in the i independent variable data sets is not particularly limited, and can be set by a person skilled in the art according to actual needs. For example, the pearson correlation coefficient (Pearson correlation coefficient) method is adopted. In statistics, the Pearson correlation coefficient is also called Pearson product-moment correlation coefficient, called PPMC or PCCs for short, and is used for measuring the correlation (linear correlation) between two variables X and Y, and the value of the Pearson product-moment correlation coefficient is between-1 and 1.

After the correlation between the target dependent variable data set and the i independent variable data sets is calculated, i correlation coefficients are obtained, j correlation coefficients larger than a correlation threshold value are determined from the i correlation coefficients, j independent variable indexes corresponding to the j correlation coefficients have correlation with the target dependent variable data set, or j independent variable indexes respectively corresponding to the j independent variable indexes are larger than the correlation threshold value, and j is a positive integer smaller than or equal to i.

From the i undetermined independent variable indexes possibly having association relation with the target dependent variable index, j undetermined independent variable indexes with higher correlation with the target dependent variable index are determined through the correlation.

S106: and determining an adjustment strategy aiming at the target dependent variable index according to the j independent variable indexes to be determined.

The j undetermined independent variable indexes screened through the correlation have an association relation with the target dependent variable index, so that the target dependent variable index can be adjusted according to the j undetermined independent variable indexes.

As one possible implementation, the adjustment strategy may be presented to the pharmaceutical enterprise in the form of a report. The report may include index quantification results, dependent variable index analysis results, and the like. The index quantization result refers to the display of the index of the independent variable or the dependent variable, such as the maximum total number of changes in which month in the historical data, the distribution condition of each change category, and the like. The analysis result of the dependent variable indexes refers to the relation display of each dependent variable index in the dependent variable indexes and the independent variable index with larger correlation, such as the correlation (such as a correlation coefficient r value, a significant level p value and the like) between the primary qualification rate and the independent variable indexes such as the total change number and the total deviation number; every time the production process deviation of a factory is increased by 1, the average one-time qualification rate is reduced by 2.1%, and pharmaceutical enterprises are recommended to reduce the production process deviation, so that the average one-time qualification rate is improved; predicted value of primary qualification rate in the future 5 months, etc.

According to the technical scheme, the data analysis request comprising the target user identifier is obtained, and n dependent variable indexes corresponding to the target user identifier are determined according to the corresponding relation between the pre-established user identifier and the dependent variable indexes. Taking a target dependent variable index in n dependent variable indexes as an example, acquiring i independent variable indexes corresponding to the target dependent variable index, acquiring a target dependent variable data set corresponding to the target dependent variable index, and acquiring the independent variable data set corresponding to the i independent variable indexes. According to the correlation between the target dependent variable data set and the i independent variable data sets, j independent variable indexes are determined from the i independent variable indexes, namely j independent variable indexes which have larger correlation with the target dependent variable indexes are determined from the i independent variable indexes possibly related to the target dependent variable indexes, screening of irrelevant indexes is realized, and accordingly an adjustment strategy for the target dependent variable indexes is determined according to the j independent variable indexes in a targeted manner. Therefore, through calculating the correlation between the independent variable indexes and the dependent variable indexes, the independent variable indexes to be determined, which have larger correlation with the target dependent variable indexes, are selected from the multiple independent variable indexes to be determined, and automatic attribution to each dependent variable index is realized, and targeted adjustment is further realized. The method does not need to be manually participated, and expands the application range while ensuring the accuracy of analysis.

The embodiment of the present application is not particularly limited to a method of determining the adjustment strategy for the target dependent variable index, and two examples will be described below.

Mode one: and (5) data prediction.

S301: and acquiring the historical self-variable data corresponding to the j undetermined independent variable indexes and the historical dependent variable data corresponding to the target dependent variable indexes.

S302: and predicting the change trend of the target dependent variable index according to the historical independent variable data and the historical dependent variable data.

For example, the historical dependent variable data is used as the output of the machine learning model, and the historical independent variable data is used as the input of the machine learning model, so that the machine learning model is trained to learn the association relation between j independent variable indexes to be determined and the target dependent variable index, and further the change trend of the target dependent variable index is predicted.

It should be noted that, the change trend not only can represent the association relationship between j independent variable indexes to be determined and the target dependent variable index in a past period of time, but also can represent the association relationship between j independent variable indexes to be determined and the target dependent variable index in a future period of time.

S303: and determining an adjustment strategy aiming at the target dependent variable index according to the change trend.

Therefore, through the association relation between the j undetermined independent variable indexes and the target dependent variable indexes, the change trend of the target dependent variable indexes is predicted, and the pharmaceutical enterprise can be helped to formulate a corresponding adjustment strategy, for example, when the production process deviation of a factory is averagely increased by 1, the yield is averagely reduced by 2.1 percent, and the pharmaceutical enterprise is recommended to reduce the production process deviation, so that the average yield of the first pass is improved.

Mode two: the data is due to.

S401: and determining k independent variable index combinations according to the j independent variable indexes to be determined.

The independent variable index combination comprises at least one independent variable index to be determined in j independent variable indexes, and k is a positive integer greater than 1. For example, if the j pending independent variable indicators include A, B and C, the independent variable indicator combination may be 7, a respectively; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C.

S402: and establishing association relations between the k independent variable index combinations and the target dependent variable indexes respectively to obtain fitting goodness and saliency corresponding to the k association relations respectively.

While it is known that j pending independent variable indicators may affect the target dependent variable indicator, it is not known how much each pending independent variable indicator affects the target dependent variable indicator, and whether there is an independent variable indicator that indirectly affects the target dependent variable indicator among the j pending independent variable indicators. Therefore, the influence of each independent variable index combination on the target dependent variable index is clear through the fitting goodness respectively corresponding to the k association relations.

S403: and determining the optimal independent variable index combination from the k independent variable index combinations according to the k fitting goodness and the k saliency.

The fitting goodness of the optimal independent variable index combination accords with a preset goodness condition and a preset saliency condition. The embodiment of the application is not particularly limited to the preset goodness condition, and a person skilled in the art may set the preset goodness condition according to actual needs. For example, k independent variable index combinations are arranged from large to small according to the goodness of fit, and the independent variable index combination of the first name is used as the optimal independent variable index combination.

The Goodness of Fit (Goodness of Fit) refers to the degree of Fit of the regression line to the observed values. The statistic that measures goodness of fit is the determinable coefficient (also known as the deterministic coefficient) R ² 。R ² The maximum value is 1.R is R ² The closer the value of (2) is to 1, the better the fitting degree of the regression line to the observed value is; conversely, R is ² The smaller the value of (c) is, the worse the fitting degree of the regression line to the observed value is. Therefore, the fitting goodness of each association relation can be obtained, so that the fitting degree between the independent variable index to be determined and the dependent variable index in each association relation can be determined, and the optimal independent variable index combination can be determined from a plurality of groups of independent variable index combinations.

The embodiment of the present application is not particularly limited to the preset significance condition, such as a significance level greater than or equal to 0.01. And removing independent variable indexes indirectly influencing the target dependent variable indexes by presetting the significance conditions. For example, assuming that the pending independent variable index includes x1 and x2, establishing an association relationship between the target dependent variable index y and the pending independent variable indexes x1 and x2, and determining a scaling factor corresponding to each pending independent variable index in the association relationship, it is found that none of them is significant (that is, it cannot prove that they are related to y). The reason for this is that here the correlation between the x1 and x2 variables is extremely large. Thus, the independent variable index indirectly affecting the target dependent variable index can be removed from the j undetermined independent variable indexes by the degree of salience.

S404: and determining an adjustment strategy aiming at the target dependent variable index according to the undetermined independent variable index included in the optimal independent variable index combination.

After the optimal independent variable index combination is determined, the influence degree of the undetermined independent variable index included in the optimal independent variable index combination on the target dependent variable index can be clarified, namely the target dependent variable index is quantitatively described. Thus, the adjustment strategy for the target dependent variable index can be determined according to the quantitative description of the target dependent variable index.

As a possible implementation manner, the j independent variable indexes to be determined can be further screened in a stepwise regression manner, so as to establish a function relation aiming at the target dependent variable index, namely, quantitative description of the target dependent variable index. Stepwise regression is performed by stepwise introduction of the index of the pending argument,

however, after each input of a pending independent variable index into an equation (i.e. an equation corresponding to the association relation between a target independent variable index and a previous pending independent variable index), each pending independent variable index in the equation is checked to see whether there is a pending independent variable index which is degraded to be 'insignificant', if yes, the equation is eliminated, so as to ensure that all the pending independent variable indexes with 'significant' function are in the equation before each input of a new pending independent variable index. And repeating until neither the index of the independent variable to be determined needs to be introduced nor the index of the independent variable to be determined is removed from the equation. Thereby obtaining an optimal regression equation and combining the optimal regression equation as an optimal independent variable index.

It should be noted that, in the data attribution process, the to-be-determined independent variable index having correlation with the target dependent variable index is further analyzed. Although these pending independent variable indicators have a correlation with the target dependent variable indicators, some pending independent variable indicators influence the target dependent variable indicators by influencing other pending independent variable indicators, that is, the pending independent variable indicators having a correlation with the target dependent variable indicators, there are pending independent variable indicators indirectly influencing the target dependent variable indicators. In order to more clearly determine which independent variable indexes to influence the target dependent variable indexes directly and to what extent, the data can be further subjected to data attribution in a mode of S401-S404, so that the independent variable indexes which directly influence the target dependent variable indexes are found, and a targeted designated adjustment strategy is further provided, so that the current production process flow is guided.

In the data prediction process, the to-be-determined independent variable index which has correlation with the target dependent variable index is directly used instead of the to-be-determined independent variable index which has direct influence on the target dependent variable index and is determined by data attribution. Because the purpose of the data prediction process is to improve the prediction accuracy of the target dependent variable, the future production process flow process is guided.

As a possible implementation manner, the embodiment of the present application further provides a specific implementation manner of S104, that is, a specific implementation manner of obtaining a target dependent variable data set corresponding to the target dependent variable index, see A1-A2:

a1: and acquiring a full-quantity dependent variable data set corresponding to the target dependent variable index.

The full-quantity dependent variable data set is a data set corresponding to the target dependent variable index and comprises at least one data corresponding to the target dependent variable index. The full-quantity dependent variable data set is all data corresponding to the target dependent variable index, and may include some abnormal data, such as data for filling errors, fake data and the like.

A2: and if the full-quantity dependent variable data set is determined to not accord with the normal distribution, taking the other dependent variable index in the n dependent variable indexes as a target dependent variable index.

According to the central limit theorem, the data should conform to the normal distribution, so the first abnormal data in the full-quantity dependent variable data set, which does not conform to the normal distribution, may be abnormal data caused by filling errors, falsification and the like. Therefore, whether the data accords with normal distribution or not is determined to carry out data filtering, and the obtained target dependent variable data set is more real data and can be used for subsequent analysis.

If there is data that does not conform to the normal distribution among the plurality of data included in the full-volume dependent variable data set, the reliability of the full-volume dependent variable data set is low, the target dependent variable index is not studied based on the full-volume dependent variable data set any more unless a reliable full-volume dependent variable data set is obtained. Then, another dependent variable index of the n dependent variable indexes, namely, the other dependent variable indexes (not the dependent variable index currently serving as the target dependent variable index) is taken as the target dependent variable index, and the analysis is continued. As an implementation manner of the pit, the embodiment of the present application further provides a specific implementation manner of S104, that is, a specific implementation manner of taking i pending independent variable indexes corresponding to the target dependent variable indexes, see B1-B3:

b1: and obtaining x full independent variable indexes corresponding to the target dependent variable indexes.

x is an integer greater than i.

B2: and acquiring x full-automatic variable data sets according to the x full-automatic variable indexes.

The full-scale self-variable data set is all data corresponding to the undetermined independent variable index, and may include some abnormal data, such as data for filling errors, fake data and the like.

Similarly, the second abnormal data in each full-automatic variable data set can be determined by determining whether each data in the i full-automatic variable data sets accords with normal distribution.

B3: if it is determined that the x total self-variable data sets have abnormal self-variable data sets which do not accord with normal distribution, removing the total self-variable index corresponding to the abnormal self-variable data sets from the x total self-variable indexes to obtain i undetermined independent variable indexes.

If the abnormal self-variable data set which does not accord with normal distribution exists in the full self-variable data set, the full self-variable data set is not trusted, and analysis based on the full self-variable data set is not performed any more, so that full independent variable indexes corresponding to the abnormal self-variable data set are removed from x full independent variable indexes, and i undetermined independent variable indexes are obtained.

S1044: and obtaining i full-quantity self-variable data sets corresponding to the i independent variable indexes to be determined respectively.

S1045: and respectively determining second abnormal data which do not accord with normal distribution in the i full-quantity self-variable data sets.

S1046: and respectively removing second abnormal data from the i full-automatic variable data sets to obtain j full-automatic variable data sets to be determined.

Therefore, whether the data accords with normal distribution or not is determined to carry out data filtering, and the obtained undetermined self-variable data set is more real data and can be used for subsequent analysis.

Since the data of pharmaceutical enterprises not only has the possibility of error data, but also has the possibility of missing data. Based on this, in order to avoid the influence of the missing data on the adjustment policy, the embodiment of the present application further provides a way to reject the missing data, see in particular S1047-S1051:

s1047: and determining i time distributions corresponding to the i undetermined self-variable data sets respectively.

For example, 28 of the 30 sets of pending self-variable data are each uploaded once a year for 10 years continuously, and the other 2 sets of pending self-variable data are only uploaded 5 times during 10 years. The time profile of the 28 sets of pending self-variable data is uploaded 10 years in succession and the time profile of the 2 sets of pending self-variable data is uploaded 5 times during 10 years.

S1048: according to the continuity of the i time distributions, p third abnormal data with abnormal continuity are determined from the i undetermined self-variable data sets.

Wherein p is an integer less than i and greater than or equal to 0.

Continuing with the foregoing example, if 28 sets of the pending self-variable data have continuity in the time distribution, and 2 sets of the pending self-variable data do not have continuity in the time distribution, then the 2 sets of the pending self-variable data are determined to be third abnormal data with abnormal continuity, that is, there is a missing set of the pending self-variable data.

S1049: and determining p undetermined independent variable indexes corresponding to the p third abnormal data respectively.

S1050: and removing the undetermined self-variable data sets corresponding to the p undetermined independent variable indexes respectively from the i undetermined self-variable data sets to obtain i-p undetermined self-variable data sets.

The time distribution has the possibility of mutual comparison analysis of the continuous undetermined self-variable data sets, and in order to avoid the influence of the undetermined self-variable data sets with the defects on the subsequent analysis, the corresponding dimension of the data, namely p undetermined independent variable indexes corresponding to p third abnormal data respectively, is directly removed, so that the accuracy of correlation judgment between the remaining i-p undetermined independent variable indexes and target dependent variable indexes is ensured.

S1051: and determining j independent variable indexes from the i-p independent variable indexes according to the correlation between the target dependent variable data set and the i-p independent variable data sets.

Therefore, after the abnormal data and the missing undetermined independent variable data set are removed, the accuracy of correlation judgment between the remaining i-p undetermined independent variable indexes and the target dependent variable indexes is ensured, so that the accuracy of determining j undetermined independent variable indexes from the i-p undetermined independent variable indexes is improved, and the accuracy of an adjustment strategy is further improved. This approach is more appropriate for data from pharmaceutical enterprises with lower authenticity and integrity.

In addition to the data processing method provided, the embodiment of the present application further provides a data processing device, as shown in fig. 2, including: an acquisition unit 201, a determination unit 202, and an adjustment unit 203;

the acquiring unit 201 is configured to acquire a data analysis request, where the data analysis request includes a target user identifier;

the obtaining unit 201 is further configured to obtain n dependent variable indexes corresponding to the target user identifier according to a correspondence between the user identifier and the dependent variable indexes, where n is an integer greater than 1;

the obtaining unit 201 is further configured to obtain i independent variable indexes to be determined corresponding to a target dependent variable index, where the target dependent variable index is one dependent variable index of the n dependent variable indexes, and i is an integer greater than 1;

the obtaining unit 201 is further configured to obtain a target dependent variable data set corresponding to the target dependent variable index, and obtain i independent variable data sets corresponding to the i independent variable indexes respectively;

the determining unit 202 is configured to determine j independent variable indexes from the i independent variable indexes according to the correlation between the target dependent variable data set and the i independent variable data sets, where the correlation between j independent variable data sets corresponding to the j independent variable indexes and the target dependent variable data set is greater than a correlation threshold, and j is a positive integer less than or equal to i;

The adjusting unit 203 is configured to adjust the target dependent variable index according to the j pending independent variable indexes.

According to the technical scheme, the data analysis request comprising the target user identifier is obtained, and n dependent variable indexes corresponding to the target user identifier are determined according to the corresponding relation between the pre-established user identifier and the dependent variable indexes. Taking a target dependent variable index in n dependent variable indexes as an example, acquiring i independent variable indexes corresponding to the target dependent variable index, acquiring a target dependent variable data set corresponding to the target dependent variable index, and acquiring the independent variable data set corresponding to the i independent variable indexes. According to the correlation between the target dependent variable data set and the i independent variable data sets, j independent variable indexes are determined from the i independent variable indexes, namely j independent variable indexes which have larger correlation with the target dependent variable indexes are determined from the i independent variable indexes possibly related to the target dependent variable indexes, and screening of irrelevant indexes is realized, so that the target dependent variable is adjusted according to the j independent variable indexes in a targeted manner. Therefore, through calculating the correlation between the independent variable indexes and the dependent variable indexes, the independent variable indexes to be determined, which have larger correlation with the target dependent variable indexes, are selected from the multiple independent variable indexes to be determined, and automatic attribution to each dependent variable index is realized, and targeted adjustment is further realized. The method does not need to be manually participated, and expands the application range while ensuring the accuracy of analysis.

As a possible implementation manner, the adjusting unit 203 is specifically configured to:

As a possible implementation manner, the obtaining unit 201 is specifically configured to:

As a possible implementation manner, the apparatus further includes a screening unit, configured to:

the determining unit 202 is specifically configured to:

As a possible implementation manner, the obtaining unit 201 is further configured to:

acquiring a user identifier creation request;

the apparatus further comprises a response unit for:

The embodiment of the present application further provides a computer device, referring to fig. 3, which shows a structural diagram of the computer device provided in the embodiment of the present application, as shown in fig. 3, where the device includes a processor 310 and a memory 320:

the memory 310 is used for storing program codes and transmitting the program codes to the processor;

the processor 320 is configured to execute any of the data processing methods provided in the foregoing embodiments according to instructions in the program code.

An embodiment of the present application provides a computer readable storage medium for storing a computer program for executing any one of the data processing methods provided in the above embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data processing methods provided in the various alternative implementations of the above aspects.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein determining an adjustment strategy for the target dependent variable indicator based on the j pending independent variable indicators comprises:

3. The method of claim 1, wherein determining an adjustment strategy for the target dependent variable indicator based on the j pending independent variable indicators comprises:

4. The method of claim 1, wherein the obtaining the set of target dependent variable data corresponding to the target dependent variable indicator comprises:

5. The method of claim 1, wherein the obtaining the i pending independent variable indices corresponding to the target dependent variable index comprises:

6. The method of claim 5, wherein after the obtaining the i sets of pending self-variable data, the method further comprises:

7. The method according to claim 1, wherein the method further comprises:

acquiring a user identifier creation request;

8. A data processing apparatus, the apparatus comprising: the device comprises an acquisition unit, a determination unit and an adjustment unit;

9. A computer device, the device comprising a processor and a memory:

The processor is configured to perform the method of any of claims 1-7 according to instructions in the program code.

10. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a computer program for executing the method of any one of claims 1-7.