CN112635031A - Data volume anomaly detection method and device, storage medium and equipment - Google Patents

Data volume anomaly detection method and device, storage medium and equipment Download PDF

Info

Publication number
CN112635031A
CN112635031A CN202011478233.4A CN202011478233A CN112635031A CN 112635031 A CN112635031 A CN 112635031A CN 202011478233 A CN202011478233 A CN 202011478233A CN 112635031 A CN112635031 A CN 112635031A
Authority
CN
China
Prior art keywords
data
batch
data volume
basic table
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011478233.4A
Other languages
Chinese (zh)
Other versions
CN112635031B (en
Inventor
许朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyiyun Technology Co ltd
Original Assignee
Beijing Yiyiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyiyun Technology Co ltd filed Critical Beijing Yiyiyun Technology Co ltd
Priority to CN202011478233.4A priority Critical patent/CN112635031B/en
Publication of CN112635031A publication Critical patent/CN112635031A/en
Application granted granted Critical
Publication of CN112635031B publication Critical patent/CN112635031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Abstract

The invention discloses a data volume anomaly detection method, which is characterized by comprising the steps of collecting N batches of sample data aiming at one service type, wherein the time length of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volumes of all non-basic tables; n is a positive integer, and T is greater than zero; counting the coefficients of each batch of the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table; according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table; calculating the predicted maximum data volume and the predicted minimum data volume corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient; and detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.

Description

Data volume anomaly detection method and device, storage medium and equipment
Technical Field
The present invention relates to data processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for detecting an anomaly of a data volume.
Background
In the field of medical data management, data is collected at a hospital deployment client (referred to as hospital side for short), and the hospital side uploads the data to a system where the data is managed. Based on this, the data volume uploaded by hospital side is required to be monitored and the reasonableness evaluation is carried out, so as to ensure that the medical data is stably transmitted without missing and repeating.
The traditional medical data quality control mode mainly provides three modes of manual experience quality control, hospital manufacturers for providing prediction indexes and completely by means of artificial intelligence. However, the manual experience quality control method has the problems of high quality control cost, fuzzy standard, quantization incapability, low quality control precision and the like, and the prediction index provided by a hospital manufacturer has the problems of low quality control precision and the like, and the method completely depends on the manual intelligence method and has high technical investment and high technical requirement.
Therefore, a data quality control method which is simple and easy to maintain, low in cost and high in quality control precision is needed.
Disclosure of Invention
The invention provides a method and a device for detecting data quantity abnormity, which at least solve the technical problems in the prior art.
One aspect of the present invention provides a method for detecting an anomaly of a data volume, the method being applied to a data system, the data system including at least one service type, each service type having a basic table and at least one non-basic table, the method including:
acquiring N batches of sample data aiming at one service type, wherein the time length of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volume of all non-basic tables; n is a positive integer, and T is greater than zero;
counting the coefficients of each batch of the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table;
according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
calculating the predicted maximum data volume and the predicted minimum data volume corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;
and detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
In the basic table and the non-basic table, each table comprises at least one record, and the data volume is the number of records contained in the table;
the basic table is used for recording basic data of a user, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user.
Wherein, the counting the coefficient of the non-basic table corresponding to each batch according to the data amount of the non-basic table and the data amount of the basic table comprises:
for any batch, the coefficients of the non-base table corresponding to that batch are: the ratio of the amount of data in the batch of the non-base table to the amount of data in the batch of the base table.
Wherein the calculating the predicted maximum data size and the predicted minimum data size corresponding to the N +1 th batch of the non-base table comprises:
collecting the data volume of the basic table of the (N + 1) th batch;
subtracting the data volume of the basic table of the Nth batch from the data volume of the basic table of the (N + 1) th batch to obtain a user increment;
the predicted maximum data size of the non-base table in the N +1 th batch is: the data volume of the non-base table in the Nth batch + the user increment said maximum factor;
the predicted minimum data size of the non-base table in the (N + 1) th batch is as follows: the non-base table has the data volume + customer increment of the nth batch said minimum factor.
Wherein, the detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal according to the predicted maximum data volume and the predicted minimum data volume comprises:
if the data volume of the non-basic table in the (N + 1) th batch is larger than or equal to the predicted minimum data volume and smaller than or equal to the predicted maximum data volume, determining that the data volume of the non-basic table in the (N + 1) th batch is normal, otherwise, determining that the data volume of the non-basic table in the (N + 1) th batch is abnormal.
And the acquired sample data of the N batches does not contain the data which is detected to be abnormal.
If the detection result is wrong, the method further comprises the step of adjusting the value of N aiming at the batch to be detected, wherein the step of adjusting the value of N comprises the following steps:
collecting sample data through a time window, wherein the starting length of the time window is M batches, the starting position of the time window is the previous batch of the batches to be detected, and the ending position of the time window is the previous M batches of the batches to be detected; when sample data is collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last collection; the times of collecting the sample data by adopting the time window is preset times;
and calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
Wherein, the calculating the error percentage corresponding to the sample data acquired each time through the time window comprises:
calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired this time for any sample data acquired once, and counting the average coefficient corresponding to the non-basic table;
calculating the corresponding predicted average data volume of the non-basic table to be detected in the batch to be detected as follows: the data volume of the first batch of the non-basic table to be detected plus the user increment is the average coefficient;
calculating the error percentage of the non-basic table to be detected corresponding to the sample data of this time as follows: and subtracting 1 from the ratio of the predicted average data volume to the data volume of the non-basic table in the batch to be detected.
Another aspect of the present invention provides an apparatus for detecting an anomaly of a data volume, the apparatus being applied to a data system, the data system including at least one service type, each service type having a basic table and at least one non-basic table, the apparatus comprising:
the acquisition module is used for acquiring N batches of sample data aiming at one service type, the time length of each batch is T, and the sample data of each batch comprises the data volume of one basic table and the data volume of all non-basic tables;
the calculation module is used for counting the coefficient of each batch corresponding to the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table; according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
the prediction module is used for calculating the predicted maximum data size and the predicted minimum data size corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;
and the detection module is used for detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
Yet another aspect of the present invention provides an apparatus, comprising:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement any of the above-described data volume anomaly detection methods.
A further aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method for detecting an abnormality in a data volume as described in any one of the above.
In the above-described data amount abnormality detection method, the detection of the data amount of the current lot depends on the data amounts of the preceding N lots; the method comprises the steps of calculating indexes, namely the predicted maximum data volume and the predicted minimum data volume, based on the data volumes of the N batches, and sequentially detecting whether the data volumes of the current batch are abnormal or not.
Drawings
FIG. 1 is a flow chart of an anomaly detection method for data volume according to an embodiment of the present invention;
FIG. 2 is a flow chart of an anomaly detection method for data volume according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of an anomaly detection apparatus for detecting data quantity according to an embodiment of the present invention;
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the field of medical data management, the existing data quality inspection methods are probably as follows:
manual experience and quality control: the quality control personnel is dependent on the experience of long-time tracking, and each data sheet of each medical institution is subjected to targeted quality control through the size and trend change of data volume increment, the characteristics of medical institutions (such as relatively similar data volume increment rate among comprehensive hospital tables, obvious difference of individual table increment of special hospitals from other tables and the like) and other special data phenomena.
The hospital end manufacturer provides prediction indexes: the manufacturer at the hospital end provides a prediction index of the data volume of each meter, quality control personnel compare the actual data volume with the prediction index, and the difference exceeding a threshold value is marked as 'abnormal'.
With the full assistance of artificial intelligence: through deep learning technologies such as a neural network and the like, the data volume is accurately predicted, and full-automatic quality control can be realized.
However, the three methods described above have problems, for example:
the following manual empirical quality control modes exist:
1. the quality control cost is high; the quality control personnel are limited, if the quality control is carried out on all key tables of a plurality of medical institutions, more quality control time or more quality control personnel are involved, experience shows that 20 medical institutions need 2-3 days for incremental quality control, and if data is updated frequently, the cost requirement on the quality control personnel is higher. Moreover, the dependence of the manual experience quality inspection on the previous batch of data is high, and if the previous batch of data has problems, the 'pivot' of quality control is lost;
2. standard blur and not quantization; the data is increased by a certain amount to be controlled by the 'normal' and 'abnormal' division point full-evaluation quality control personnel according to self, the experience dependence on the quality control personnel is high, and the quality control standard is changed due to the replacement of the quality control personnel, so that the problem of non-uniform quality control standard is caused;
3. the quality control precision is low; empirical quality control is sensitive to the "time interval" of data update, and if the data is not a complete batch (for example, the data is updated in 7 days, but the actual situation is often due to the advance or delay of the problems of "holidays", "techniques", and the like), the "experience" of quality control personnel is challenged; meanwhile, if the data volume is influenced greatly by factors such as 'holidays', 'disease outbreaks' and the like included in the batch, the precision of quality control is influenced inevitably.
The way that the hospital end manufacturer provides the prediction index has the following problems:
1. the 'index' provided by the manufacturer often 'says' or does not reveal 'the principle', and if an error occurs, the correction cannot be carried out;
2. manufacturers still encounter the following problems in determining the index: if the index is determined manually or in an artificial intelligence mode, the problems of cost and precision are also faced, and therefore, the 'index' provided by a common manufacturer is often inaccurate and cannot be optimized.
The following problems exist in a completely artificial intelligence-aided manner:
1. the technical requirement is high; the medical data is not updated data of 'equal-interval batches', namely a simple 'time prediction model', and a plurality of influence factors such as 'holidays' and 'diseases' need to be considered, and the factors are difficult to acquire and use.
2. The investment cost is high; the main subject of quality control is each table of each medical institution, which means that each table of each medical institution needs a model for support, and needs to be maintained and updated later, and the development and maintenance costs are very high.
The embodiment of the invention provides a data quantity anomaly detection method, which can be used for at least solving the problems encountered in medical data quality inspection, and achieving the purposes of releasing labor cost, improving quality control precision, improving automation degree, reducing technical requirements, technical input cost and the like, and the purpose that a solution can be optimized and transplanted. Of course, the data quantity anomaly detection method in the embodiment of the invention is not only suitable for quality inspection of medical data, but also suitable for quality inspection of traffic data and other data similar to the medical data in characteristics.
Fig. 1 shows an anomaly detection method for data volume according to an embodiment of the present invention, which is applied to a data system including at least one service type, where each service type has a basic table and at least one non-basic table. The table under each service type is detected independently, and the detection method of the table under each service type is the same, and the method comprises the following steps of:
step 101, collecting N batches of sample data, wherein the time length of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volume of at least one non-basic table; n is a positive integer, and T is greater than zero.
In the basic table and the non-basic table, each table includes at least one record, and in this embodiment, the data amount of each table refers to the number of records included in the table. The basic table is used for recording basic data of a user, taking medical data as an example, if the service type is outpatient service, the basic table can be a registration table, each record in the registration table corresponds to registration information of one patient, namely basic data (including outpatient service number, name, age, identity card, department and the like), and each record corresponds to a unique user identifier, such as an outpatient service number; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user, for example, after the registration of a patient, order information, medicine injection information and the like can be generated, and the information is recorded by the corresponding non-basic table, and each record in the non-basic table is also associated with a unique user identification, namely an outpatient number.
It should be noted that the same user may correspond to different user identifiers in different service types, for example, when the service type is hospitalization, a hospital number is assigned to the patient, which is different from the clinic number. The outpatient service number is used for identifying the service type generated by the user as outpatient service data, and the hospitalization number is used for identifying the service type generated by the user as hospitalization data.
In addition, in the data system, some tables may record data of different service types at the same time, for example, an outpatient service may generate order information, and an inpatient service may also generate order information, so that the order table may include an order record of the outpatient service and an order record of the inpatient service at the same time, so that in this step, when sample data of one service type is acquired, for this type of table, data in compliance with this service type in the table needs to be screened out, and sample data of the table under this service type is formed, for example, when sample data of the outpatient service is acquired, an order table of the outpatient service type in the order table is screened out, and is formed for use in subsequent detection.
In this embodiment, data of N batches may be acquired at a time, where a time duration corresponding to the data of one batch is T, for example, T is one month, and N is 6, then data of 6 months is acquired this time. The sample data comprises the data volume of a basic table and the data volume of at least one non-basic table, and if the detected target is a certain non-basic table, only the data volume of the basic table and the data volume of the non-basic table can be collected. For example, the data amount of the dial-up table and the data amount of the drug injection table per month for 1-6 months are collected.
And 102, counting the coefficient of each batch corresponding to the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table.
For any batch, the coefficients of the non-base table corresponding to that batch are: the ratio of the amount of data in the batch of the non-base table to the amount of data in the batch of the base table.
For example, for a data volume of 1-6 months collected, the first batch, i.e., month 1, the coefficients for the drug injection table are: data volume of drug injection table 1 month/data volume of registration table 1 month; for the second batch, i.e., month 2, the coefficients for the drug injection table are: data volume of the drug injection table in month 2/data volume of the registration table in month 2, and so on, to obtain coefficients of the drug injection table corresponding to 6 batches.
And 103, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table according to the coefficients of the N batches corresponding to the non-basic table.
For a non-basic table, after counting the coefficients of the non-basic table corresponding to each batch, obtaining N coefficients, and determining the maximum coefficient and the minimum coefficient.
And 104, calculating the predicted maximum data volume and the predicted minimum data volume corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient.
In this step, to calculate the predicted maximum data volume and the predicted minimum data volume of the (N + 1) th batch, first, the user increment of the (N + 1) th batch relative to the nth batch needs to be calculated: collecting the data volume of the basic table of the (N + 1) th batch; and subtracting the data volume of the base table of the Nth batch from the data volume of the base table of the (N + 1) th batch to obtain the user increment. Since the basic table is used for recording the basic data of the users, and each record corresponds to a unique user identifier, it can be determined without any problem that the number of records in the basic table also represents the number of users, and the user increment is the data amount of the basic table of the (N + 1) th batch minus the data amount of the basic table of the (N) th batch.
After the user increment is obtained, calculating the predicted maximum data volume and the predicted minimum data volume of the (N + 1) th batch:
the predicted maximum data size of the non-base table in the N +1 th batch is: the data volume of the non-base table in the Nth batch + the user increment said maximum factor;
the predicted minimum data size of the non-base table in the (N + 1) th batch is as follows: the non-base table has the data volume + customer increment of the nth batch said minimum factor.
And 105, detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
In this embodiment, preferably, if the data amount (actually occurring) of the non-base table in the (N + 1) th lot is greater than or equal to the predicted minimum data amount and less than or equal to the predicted maximum data amount, the data amount of the non-base table in the (N + 1) th lot is determined to be normal, otherwise, the data amount of the non-base table in the (N + 1) th lot is determined to be abnormal.
Therefore, in the detection mode, the detection of the data volume of the current batch depends on the data volumes of the previous N batches, the detection process does not need to depend on human experience, does not need indexes provided by manufacturers, and does not have a complex artificial intelligence algorithm, so that the purposes of releasing the labor cost, improving the quality control precision, improving the automation degree, reducing the technical requirements, reducing the technical input cost and the like are achieved.
In order to reduce the influence of abnormal data on subsequent data detection, the data detected as abnormal is not contained in the collected sample data of N batches in the next detection.
The detection method of the present invention may be performed periodically or aperiodically.
The above data detection process of the present invention is illustrated by a specific embodiment.
Taking the example of detecting the medical advice table of a hospital outpatient service, assuming that N is 5 and T is 1 month, and detecting whether the data volume of the medical advice table in the current 11 months is abnormal, then:
1. collecting sample data;
the data volume of the registration table of 5 batches in total in 6-10 months is collected and expressed as: bg6, Bg7, Bg8, Bg9 and Bg 10; data volumes were collected for 5 batches of orders sheets (clinic type) for 6-10 months, and are expressed as: by6, By7, By8, By9 and By 10.
2, calculating a coefficient;
calculate the coefficients X6-X10 for the order table corresponding to 5 batches, then:
X6=By6/Bg6;
X7=By7/Bg7;
X8=By8/Bg8;
X9=By9/Bg9;
X10=By10/Bg10;
assume that the maximum coefficient is X7 and the minimum coefficient is X10.
3. Calculating a predicted maximum data size and a predicted minimum data size for an 11-month medical order Table
User increment w ═ Bg11-Bg 10; wherein Bg11 is the data size of the hanging list generated in 11 months;
predicting the maximum amount of data By11max=By10+w*X7;
Predicting minimum data amount By11min=By10+w*X10。
4. Determine whether the data volume By11 of the order table for month 11 is abnormal
If By11min≤By11≤By11maxThen By11 is determined to be normal; otherwise, By11 is determined to be abnormal.
5. Automatic optimization
When the next detection period comes, for example, whether the order form of 12 months is abnormal or not is to be detected, if By11 is already detected to be abnormal, the sample data is collected this time, data of 11 months is not included, registration form data volumes of 5 batches in total in 6-10 months and medical order form (outpatient service type) data volumes of 5 batches in total in 6-10 months are still collected, and then the steps are repeated to detect the abnormality of the data volumes of the order form of 12 months.
In addition, if the detection result is incorrect, for example, the actual data amount is normal but the detection is abnormal, or the actual data amount is abnormal but the detection is normal, the above detection mode needs to be optimized, for example, the value of N is adjusted, as shown in fig. 2, which includes:
step 201, collecting sample data through a time window, wherein the starting length of the time window is M batches, the duration of each batch is T, the starting position of the time window is the previous batch of the batches to be detected, and the ending position of the time window is the previous M batches of the batches to be detected; when sample data is collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last collection; and the times of acquiring the sample data by adopting the time window is preset times.
Step 202, calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
Calculating the error percentage corresponding to the sample data acquired each time, wherein the error percentage comprises the following steps:
calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired this time, and counting the average coefficient corresponding to the non-basic table;
calculating the corresponding predicted average data volume of the non-basic table to be detected in the batch to be detected as follows: the data volume of the first batch of the non-basic table to be detected plus the user increment is the average coefficient;
calculating the error percentage of the non-basic table to be detected corresponding to the sample data of this time as follows: and subtracting 1 from the ratio of the predicted average data volume to the data volume of the non-basic table in the batch to be detected.
Here, in addition to the prediction of the data amount using the average coefficient, the data amount may be predicted using a linear regression prediction value, a composite growth rate evaluation value, or the like.
When the value of N is adjusted, sample data may be acquired in a time window manner, where a starting length of the time window is M batches, and if M is 3, it indicates that sample data of 3 consecutive batches are acquired when sample data is acquired in the 1 st time of the time window, a starting position of the time window is a previous batch of the batch to be detected, an ending position of the time window is a previous M batches of the batch to be detected, and if the batch to be detected is a 12 th batch, the starting position of the time window is an 11 th batch, and the ending position of the time window is a 9 th batch, that is, sample data of 3 consecutive batches are acquired when sample data is acquired in the 1 st time; when sample data is acquired each time, the starting position of the time window is unchanged (i.e. starting from the 11 th batch), the ending position of the time window is moved forward by P batches compared with the last acquisition, if P is 1, when the sample data is acquired at the 2 nd time, the ending position of the time window is moved forward by1 batch, i.e. the 8 th batch compared with the ending position of the time window at the 1 st time (the 9 th batch), then the sample data of 4 consecutive batches of 11, 10, 9, 8 is acquired at the 2 nd time, the sample data of 5 consecutive batches of 11, 10, 9, 8, 7 is acquired at the 3 rd time, and so on until the sample data is acquired for a predetermined number of times.
For the sample data of the previous 3 batches (3 batches in total, 11, 10, and 9) collected at the 1 st time, calculating coefficients of the 3 batches corresponding to the non-basic table, and calculating an average coefficient (i.e. an average of the 3 coefficients) corresponding to the non-basic table, wherein a predicted average data amount corresponding to the current batch (12 th batch) of the non-basic table is calculated as: data volume of the non-base table in the previous batch (11 th batch) plus user increment average coefficient; then, the error percentage of the non-base table corresponding to the current batch (12 th batch) (predicted average data amount/data amount of the non-base table in the current batch) is calculated as 1.
By analogy, the error percentages corresponding to the sample data of the first 4 batches (4 consecutive batches 11, 10, 9 and 8) and the sample data of the first 5 batches (5 consecutive batches 11, 10, 9, 8 and 7) are calculated. Then, comparing absolute values of the three error percentages, taking the number of batches corresponding to the length of the time window corresponding to the error percentage with the smallest absolute value as a value of N, assuming that the absolute value of the error percentage of the first 4 batches is calculated to be the smallest, and then enabling N to be 4, and performing subsequent data detection.
It should be noted that:
1. the optimization operation is not easy to be frequent, can be performed once every fixed time (for example, half a year or a year), and when the optimization is performed, the current batch is the batch to be detected. Since frequent optimization increases the risk of "overfitting";
2. the upper limit and the lower limit of the time window length need to be limited, for example, the upper limit and the lower limit are respectively 3 and 12, that is, the lower limit of the collection is 3 consecutive batches, and the upper limit of the collection is 12 consecutive batches, in order to eliminate the problems of unstable detection caused by "insufficient sample" and too low sensitivity of detection caused by "excessive sample".
In order to implement the foregoing method for detecting an anomaly of a data volume, the present invention further provides an apparatus for detecting an anomaly of a data volume, where the apparatus is applied to a data system, the data system includes at least one service type, and each service type has a basic table and at least one non-basic table, as shown in fig. 3, the apparatus includes:
the acquisition module 10 is configured to acquire, for one service type, N batches of sample data, where a duration of each batch is T, and the sample data of each batch includes a data size of one basic table and data sizes of all non-basic tables;
a calculating module 20, configured to count, for any one non-basic table, a coefficient of the non-basic table corresponding to each batch according to a data amount of the non-basic table and a data amount of the basic table; according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
the prediction module 30 is configured to calculate a predicted maximum data size and a predicted minimum data size corresponding to the N +1 th batch of the non-base table according to the maximum coefficient and the minimum coefficient;
and the detecting module 40 is configured to detect whether the data volume of the non-basic table in the (N + 1) th batch is abnormal according to the predicted maximum data volume and the predicted minimum data volume.
In the basic table and the non-basic table, each table comprises at least one record, and the data volume is the number of records contained in the table;
the basic table is used for recording basic data of a user, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user.
When the statistics on the coefficient of each batch corresponding to the non-basic table according to the data amount of the non-basic table and the data amount of the basic table is performed, the calculating module 20 is further configured to, for any batch, determine that the coefficient of the non-basic table corresponding to the batch is: the ratio of the amount of data in the batch of the non-base table to the amount of data in the batch of the base table.
When the predicted maximum data volume and the predicted minimum data volume corresponding to the (N + 1) th batch of the non-basic table are calculated: the prediction module 30 is further configured to collect data volume of the base table of the (N + 1) th batch; subtracting the data volume of the basic table of the Nth batch from the data volume of the basic table of the (N + 1) th batch to obtain a user increment; the predicted maximum data size of the non-base table in the N +1 th batch is: the data volume of the non-base table in the Nth batch + the user increment said maximum factor; the predicted minimum data size of the non-base table in the (N + 1) th batch is as follows: the non-base table has the data volume + customer increment of the nth batch said minimum factor.
The detecting module 40 is further configured to determine that the data volume of the non-base table in the (N + 1) th batch is normal when the data volume of the non-base table in the (N + 1) th batch is greater than or equal to the predicted minimum data volume and less than or equal to the predicted maximum data volume, and otherwise determine that the data volume of the non-base table in the (N + 1) th batch is abnormal.
If the detection result is wrong, the device further comprises an optimization module 50, configured to adjust the value of N for the to-be-detected batch;
the acquisition module 10 is further configured to acquire sample data through a time window, where a starting length of the time window is M batches, a starting position of the time window is a previous batch of the batches to be detected, and an ending position of the time window is M batches of the previous batches to be detected; when sample data is collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last collection; the times of collecting the sample data by adopting the time window is preset times;
the optimization module 50 is further configured to calculate an error percentage corresponding to sample data acquired each time through the time window, and use the batch number of the sample data corresponding to the error percentage with the smallest absolute value as the value of N.
Wherein, when calculating the error percentage corresponding to the sample data collected each time:
the calculating module 20 is further configured to calculate a coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired this time, and count an average coefficient corresponding to the non-basic table;
the prediction module 30 is further configured to calculate a predicted average data amount of the to-be-detected non-base table corresponding to the to-be-detected batch as follows: the data volume of the first batch of the non-basic table to be detected plus the user increment is the average coefficient;
the optimization module 50 is further configured to calculate the error percentage of the non-basic table to be detected corresponding to the sample data of this time as follows: and subtracting 1 from the ratio of the predicted average data volume to the data volume of the non-basic table in the batch to be detected.
In addition, an embodiment of the present invention further provides an apparatus, including:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the above-described method for detecting an abnormality in the amount of data.
Another embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the above-described method for detecting an abnormality in an amount of data.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (11)

1. A method for detecting anomalies in data volumes, the method being applied to a data system comprising at least one traffic type, each traffic type having a base table and at least one non-base table, the method comprising:
acquiring N batches of sample data aiming at one service type, wherein the time length of each batch is T, and the sample data of each batch comprises the data volume of a basic table and the data volume of all non-basic tables; n is a positive integer, and T is greater than zero;
counting the coefficients of each batch of the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table;
according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
calculating the predicted maximum data volume and the predicted minimum data volume corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;
and detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
2. The method of claim 1,
in the basic table and the non-basic table, each table comprises at least one record, and the data volume is the number of records contained in the table;
the basic table is used for recording basic data of a user, and each record corresponds to a unique user identifier; after the basic data of the user is generated, the non-basic table is used for recording the associated data generated by the user.
3. The method of claim 2, wherein the counting the coefficients of the non-base table corresponding to each batch according to the data amount of the non-base table and the data amount of the base table comprises:
for any batch, the coefficients of the non-base table corresponding to that batch are: the ratio of the amount of data in the batch of the non-base table to the amount of data in the batch of the base table.
4. The method of claim 2, wherein calculating the predicted maximum data size and the predicted minimum data size for the non-base table in the (N + 1) th batch comprises:
collecting the data volume of the basic table of the (N + 1) th batch;
subtracting the data volume of the basic table of the Nth batch from the data volume of the basic table of the (N + 1) th batch to obtain a user increment;
the predicted maximum data size of the non-base table in the N +1 th batch is: the data volume of the non-base table in the Nth batch + the user increment said maximum factor;
the predicted minimum data size of the non-base table in the (N + 1) th batch is as follows: the non-base table has the data volume + customer increment of the nth batch said minimum factor.
5. The method of claim 4, wherein said detecting whether the data size of the non-base table in the N +1 th lot is abnormal according to the predicted maximum data size and the predicted minimum data size comprises:
if the data volume of the non-basic table in the (N + 1) th batch is larger than or equal to the predicted minimum data volume and smaller than or equal to the predicted maximum data volume, determining that the data volume of the non-basic table in the (N + 1) th batch is normal, otherwise, determining that the data volume of the non-basic table in the (N + 1) th batch is abnormal.
6. The method of claim 5, wherein said collected N batches of sample data do not contain data that has been detected as anomalous.
7. The method of claim 4, wherein if the detection result is incorrect, the method further comprises, for the lot to be detected, adjusting the value of N, including:
collecting sample data through a time window, wherein the starting length of the time window is M batches, the starting position of the time window is the previous batch of the batches to be detected, and the ending position of the time window is the previous M batches of the batches to be detected; when sample data is collected each time, the starting position of the time window is unchanged, and the ending position of the time window is moved forward by P batches compared with the last collection; the times of collecting the sample data by adopting the time window is preset times;
and calculating the error percentage corresponding to the sample data acquired each time through the time window, and taking the batch number of the sample data corresponding to the error percentage with the minimum absolute value as the value of N.
8. The method of claim 7, wherein calculating the error percentage corresponding to the sample data acquired each time through the time window comprises:
calculating the coefficient of the non-basic table to be detected corresponding to each batch in the sample data acquired this time for any sample data acquired once, and counting the average coefficient corresponding to the non-basic table;
calculating the corresponding predicted average data volume of the non-basic table to be detected in the batch to be detected as follows: the data volume of the first batch of the non-basic table to be detected plus the user increment is the average coefficient;
calculating the error percentage of the non-basic table to be detected corresponding to the sample data of this time as follows: and subtracting 1 from the ratio of the predicted average data volume to the data volume of the non-basic table in the batch to be detected.
9. An apparatus for detecting anomalies in data volumes, the apparatus being applied to a data system comprising at least one traffic type, each traffic type having a base table and at least one non-base table, the apparatus comprising:
the acquisition module is used for acquiring N batches of sample data aiming at one service type, the time length of each batch is T, and the sample data of each batch comprises the data volume of one basic table and the data volume of all non-basic tables;
the calculation module is used for counting the coefficient of each batch corresponding to the non-basic table according to the data quantity of the non-basic table and the data quantity of the basic table aiming at any one non-basic table; according to the coefficients of N batches corresponding to the non-basic table, counting the maximum coefficient and the minimum coefficient corresponding to the non-basic table;
the prediction module is used for calculating the predicted maximum data size and the predicted minimum data size corresponding to the (N + 1) th batch of the non-basic table according to the maximum coefficient and the minimum coefficient;
and the detection module is used for detecting whether the data volume of the non-basic table in the (N + 1) th batch is abnormal or not according to the predicted maximum data volume and the predicted minimum data volume.
10. An apparatus, characterized in that the apparatus comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN202011478233.4A 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment Active CN112635031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011478233.4A CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011478233.4A CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN112635031A true CN112635031A (en) 2021-04-09
CN112635031B CN112635031B (en) 2023-08-29

Family

ID=75313495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011478233.4A Active CN112635031B (en) 2020-12-15 2020-12-15 Data volume anomaly detection method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112635031B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189416A1 (en) * 2011-08-26 2014-07-03 Hitichi, Ltd. Predictive sequential calculation device
WO2016033973A1 (en) * 2014-09-05 2016-03-10 中兴通讯股份有限公司 Method and system for predicting resource occupancy
CN106815255A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 The method and device of detection data access exception
CN110517774A (en) * 2019-08-06 2019-11-29 国云科技股份有限公司 A method of prediction abnormal body temperature
CN110839032A (en) * 2019-11-18 2020-02-25 河南牧业经济学院 Internet of things abnormal data identification method and system
US20200245902A1 (en) * 2017-10-30 2020-08-06 Maxell, Ltd. Abnormal data processing system and abnormal data processing method
CN111612651A (en) * 2020-05-27 2020-09-01 福州大学 Abnormal electric quantity data detection method based on long-term and short-term memory network
CN111694815A (en) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 Database anomaly detection method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189416A1 (en) * 2011-08-26 2014-07-03 Hitichi, Ltd. Predictive sequential calculation device
WO2016033973A1 (en) * 2014-09-05 2016-03-10 中兴通讯股份有限公司 Method and system for predicting resource occupancy
CN106815255A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 The method and device of detection data access exception
US20200245902A1 (en) * 2017-10-30 2020-08-06 Maxell, Ltd. Abnormal data processing system and abnormal data processing method
CN110517774A (en) * 2019-08-06 2019-11-29 国云科技股份有限公司 A method of prediction abnormal body temperature
CN110839032A (en) * 2019-11-18 2020-02-25 河南牧业经济学院 Internet of things abnormal data identification method and system
CN111612651A (en) * 2020-05-27 2020-09-01 福州大学 Abnormal electric quantity data detection method based on long-term and short-term memory network
CN111694815A (en) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 Database anomaly detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YONGCHANG GAO,等: "An Efficient Fraud Identification Method Combining Manifold Learning and Outliers Detection in Mobile Healthcare Services", IEEE ACESS, vol. 06, pages 60059 - 60068 *
楼磊磊: "医疗保险数据异常行为检测算法和系统", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 02, pages 138 - 1097 *

Also Published As

Publication number Publication date
CN112635031B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
CN112365987B (en) Diagnostic data abnormality detection method, diagnostic data abnormality detection device, computer device, and storage medium
US10140422B2 (en) Progression analytics system
RU2507575C2 (en) Method and apparatus for identifying relationships in data based on time-dependent relationships
WO2011039741A1 (en) Monitoring device for mangement of insulin delivery
RU2707720C2 (en) System for automated analysis of laboratory test results and risk notification in intensive care unit
CN106611023B (en) Method and device for detecting website access abnormality
US20120157793A1 (en) Medication intake analyzer
CN110874674A (en) Anomaly detection method, device and equipment
CN115985523B (en) Digital chronic disease follow-up management system
CN115691722B (en) Quality control method, device, equipment, medium and program product for medical data detection
JP2023509238A (en) clinical risk model
Lee et al. Prediction of impending mood episode recurrence using real-time digital phenotypes in major depression and bipolar disorders in South Korea: a prospective nationwide cohort study
CN111371647A (en) Data center monitoring data preprocessing method and device
CN108376553B (en) Monitoring method and system for magnetic disk of video server
US20150106124A1 (en) Date and time accuracy testing patient data transferred from a remote device
CN112635031A (en) Data volume anomaly detection method and device, storage medium and equipment
CN111737233A (en) Data monitoring method and device
CN114325232B (en) Fault positioning method and device
JP2020181443A (en) Abnormality detection apparatus, abnormality detection method, and computer program
JP2017021497A (en) Load actual data discrimination device, load prediction device, load actual data discrimination method and load prediction method
WO2014036032A1 (en) Task optimization in remote health monitoring systems
CN115910374B (en) Hospital infectious disease aggregation time early warning method and medium
CN116506205B (en) Data processing method and system of intelligent medical platform
CN116646090A (en) Symptom association system based on behavior analysis
CN112185575B (en) Method and device for determining medical data to be compared

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant