WO2022252079A1 - 数据处理方法及装置 - Google Patents

数据处理方法及装置 Download PDF

Info

Publication number
WO2022252079A1
WO2022252079A1 PCT/CN2021/097480 CN2021097480W WO2022252079A1 WO 2022252079 A1 WO2022252079 A1 WO 2022252079A1 CN 2021097480 W CN2021097480 W CN 2021097480W WO 2022252079 A1 WO2022252079 A1 WO 2022252079A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
threshold
sample
data
focus
Prior art date
Application number
PCT/CN2021/097480
Other languages
English (en)
French (fr)
Inventor
王瑜
王川
王海金
贺王强
柴栋
吴建民
雷一鸣
王洪
Original Assignee
京东方科技集团股份有限公司
北京中祥英科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京中祥英科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2021/097480 priority Critical patent/WO2022252079A1/zh
Priority to CN202180001379.6A priority patent/CN115943372A/zh
Publication of WO2022252079A1 publication Critical patent/WO2022252079A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present disclosure relates to the field of data processing, and in particular to a data processing method and device.
  • a data processing method includes: firstly, obtaining sample data in response to a user's input operation on the graphical interface, the sample data including characteristic data and detection data of the sample; then, based on the sample data, displaying a sample distribution map on the graphical interface; and then Obtain the focus threshold used to divide positive and negative samples, display the focus threshold mark in the sample distribution diagram of the graphical interface, and distinguish the data display effect of positive and negative samples based on the focus threshold; wherein, the focus threshold is determined based on the detection data of the sample ; Finally, based on the positive and negative samples, determine the cause of the sample abnormality.
  • the focus threshold includes a first focus threshold, and the first focus threshold is one or more.
  • the focus threshold used to divide the positive and negative samples is acquired, and the focus threshold mark is displayed in the sample distribution diagram of the graphical interface.
  • distinguish the data display effect of positive and negative samples based on the focus threshold including: receiving the user's setting operation on the first focus threshold, displaying the first focus threshold mark in the sample distribution diagram of the graphical interface, and based on the first focus threshold The data display effect of distinguishing positive and negative samples.
  • the above-mentioned first focus threshold includes a first value
  • the data display effect of distinguishing positive and negative samples based on the first focus threshold includes: distinguishing positive and negative samples based on the relationship between the detection data of the sample and the first value. Data display effect.
  • the above-mentioned first focus threshold includes a second value and a third value, and the second value is smaller than the third value.
  • the above-mentioned data display effect of distinguishing positive and negative samples based on the first focus threshold includes: sample-based detection Whether the data is greater than the second value and less than the third value distinguishes the data display effect of positive and negative samples.
  • the focus threshold further includes a second focus threshold, the number of samples is N, the focus threshold used to divide the positive and negative samples is obtained above, and the focus threshold mark is displayed in the sample distribution diagram of the graphical interface, and based on the focus
  • the threshold distinguishes the data display effect of positive and negative samples, including: arranging the detection data of N samples in order from small to large, and using the median or mean of the detection data of N samples as the reference focus value; based on the reference focus value and Determine the second focus threshold for the detection data of N samples; display the second focus threshold mark in the sample distribution diagram of the graphical interface, and distinguish the data display effect of positive and negative samples based on the second focus threshold.
  • the above method further includes: filtering the sample data based on the user's filtering operation on the filtering threshold, and displaying a distribution map of the filtered samples on a graphical interface.
  • the filtering threshold includes at least one of abnormality rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the above-mentioned sample includes multiple sub-samples, and the abnormal rate is used In order to indicate the ratio of the number of abnormal sub-samples in each sample to the total number of sub-samples included in the sample; the arrival rate is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample.
  • the filtering operation includes a setting operation and a selection operation.
  • the feature data of the sample includes at least one of product model, detection site, abnormal type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the above-mentioned samples include at least one of abnormality rate or measurement parameters.
  • a data processing method comprising: firstly, acquiring sample data, the sample data including sample characteristic data and detection data; then, based on the sample detection data, determining a focus threshold; and then based on the focus threshold , divide the sample into positive and negative samples; finally, based on the positive and negative samples, determine the reason for the abnormality of the sample.
  • the above-mentioned focus threshold includes a second focus threshold, and the number of samples is N.
  • the above-mentioned determination of the focus threshold based on the detection data of the samples includes: arranging the detection data of N samples in ascending order, N The median or mean value of the detection data of the samples is used as a reference focus value; based on the reference focus value and the detection data of N samples, a second focus threshold is determined.
  • the above method further includes: filtering the sample data based on a filtering threshold.
  • the filtering threshold includes at least one of abnormality rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the above-mentioned sample includes multiple sub-samples, and the abnormal rate is used In order to indicate the ratio of the number of abnormal sub-samples in each sample to the total number of sub-samples included in the sample; the arrival rate is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample.
  • the feature data of the sample includes at least one of product model, detection site, abnormal type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the above-mentioned samples include at least one of abnormality rate or measurement parameters.
  • a data processing device including: an acquisition module, configured to acquire sample data in response to user input operations on a graphical interface, the sample data including characteristic data and detection data of the sample; a display module, configured to acquire The sample data obtained by the module is used to display the sample distribution map on the graphical interface; the acquisition module is also used to obtain the focus threshold used to divide positive and negative samples; The focus threshold mark is displayed in the distribution graph, and the data display effect of positive and negative samples is distinguished based on the focus threshold; wherein, the focus threshold is determined based on the detection data of the sample; the determination module is used to determine the cause of the sample abnormality based on the positive and negative samples.
  • the focus threshold includes a first focus threshold, and there are one or more first focus thresholds; the acquisition module is further configured to receive a user's setting operation on the first focus threshold; the display module is further It is used to display the first focus threshold mark in the sample distribution diagram of the graphical interface, and to distinguish the data display effect of positive and negative samples based on the first focus threshold.
  • the above-mentioned first focus threshold includes a first numerical value
  • the above-mentioned display module is specifically configured to: distinguish the data display effect of positive and negative samples based on the magnitude relationship between the detected data of the sample and the first numerical value.
  • the above-mentioned first focus threshold includes a second value and a third value
  • the second value is smaller than the third value
  • the display module is specifically used to: whether the detection data based on the sample is greater than the second value and smaller than the third value
  • the numerical value distinguishes the data display effect of positive and negative samples.
  • the above focus threshold includes a second focus threshold, the number of samples is N, and the acquisition module is specifically used to: arrange the detection data of N samples in order from small to large, and arrange the detection data of N samples The median or mean value of is used as the reference focus value; based on the reference focus value and the detection data of N samples, the second focus threshold is determined; the second focus threshold mark is displayed in the sample distribution diagram of the graphical interface, and based on the second focus threshold The data display effect of distinguishing positive and negative samples.
  • the above-mentioned acquisition module is also specifically configured to perform the following steps: step a, average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean Mean l , and the N samples In the detection data of the sample, the detection data greater than the reference focus value are averaged to obtain the second mean value Mean u ; step b, the detection data of the N samples arranged in sequence are compared with the first mean value Mean l one by one and take the absolute value to obtain the second mean value Mean u;
  • One mean difference DiffLowerMean [l 1 ,l 2 ,l 3 ...,l N ]
  • the above-mentioned data processing device further includes a screening module; the screening module is used to filter the sample data based on the user's filtering operation on the filtering threshold; the display module is also used to display the filtered samples on the graphical interface distribution map.
  • the filtering threshold includes at least one of abnormal rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the sample includes multiple sub-samples, and the abnormal rate is used for Indicates the ratio of the number of abnormal subsamples in each sample to the total number of subsamples included in the sample; the arrival rate is used to indicate the ratio of the number of subsamples actually detected in each sample to the total number of subsamples included in the sample.
  • the filtering operation includes a setting operation and a selection operation.
  • the feature data of the sample includes at least one of product model, detection site, abnormal type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the above-mentioned samples include at least one of abnormality rate or measurement parameters.
  • a data processing device which includes: an acquisition module, configured to acquire sample data, the sample data including characteristic data and detection data of the sample; a determination module, configured to determine a focus threshold based on the detection data of the sample; The division module is used to divide the samples into positive and negative samples based on the focusing threshold; the determination module is also used to determine the cause of the sample abnormality based on the positive and negative samples.
  • the above focus threshold includes a second focus threshold, the number of samples is N, and the determination module is specifically used to: arrange the detection data of N samples in order from small to large, and arrange the detection data of N samples The median or mean value of is used as a reference focus value; based on the reference focus value and the detection data of N samples, a second focus threshold is determined.
  • the above-mentioned determining module is further configured to perform the following steps: step a, average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean value Mean l , and divide the N samples In the detection data of the sample, the detection data greater than the reference focus value are averaged to obtain the second mean value Mean u ; step b, the detection data of the N samples arranged in sequence are compared with the first mean value Mean l one by one and take the absolute value to obtain the second mean value Mean u;
  • One mean difference DiffLowerMean [l 1 ,l 2 ,l 3 ...,l N ]
  • the above data processing device further includes a screening module, configured to: screen the sample data based on a filtering threshold.
  • the filtering threshold includes at least one of abnormal rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the sample includes multiple sub-samples, and the abnormal rate is used for Indicates the ratio of the number of abnormal subsamples in each sample to the total number of subsamples included in the sample; the arrival rate is used to indicate the ratio of the number of subsamples actually detected in each sample to the total number of subsamples included in the sample.
  • the feature data of the sample includes at least one of product model, detection site, abnormal type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the above-mentioned samples include at least one of abnormality rate or measurement parameters.
  • a data processing device in yet another aspect, includes a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program codes, and the computer program codes include computer instructions; wherein, when the processor executes the computer instructions , causing the device to execute one or more steps in the data processing method described in any of the above embodiments.
  • a non-transitory computer-readable storage medium stores computer program instructions, and when the computer program instructions run on a processor, the processor executes the above-mentioned One or more steps in the data processing method described in any embodiment.
  • a computer program product includes a computer program, and when the computer program instructions are executed on a computer, the computer program instructions cause the computer to execute the data processing method as described in any of the above embodiments One or more steps in .
  • FIG. 1 is a structural diagram of a data processing device according to some embodiments.
  • Fig. 2 is a kind of flowchart of the data processing method according to some embodiments.
  • Fig. 3 is a display effect diagram of a data processing method according to some embodiments.
  • Fig. 4 is another display effect diagram of a data processing method according to some embodiments.
  • FIG. 5 is another flowchart of a data processing method according to some embodiments.
  • Fig. 6 is another display effect diagram of a data processing method according to some embodiments.
  • FIG. 7 is another flowchart of a data processing method according to some embodiments.
  • Fig. 8 is another display effect diagram of a data processing method according to some embodiments.
  • Fig. 9 is another display effect diagram of a data processing method according to some embodiments.
  • Fig. 10 is another display effect diagram of the data processing method according to some embodiments.
  • Fig. 11 is another flowchart of a data processing method according to some embodiments.
  • Fig. 12 is another flowchart of a data processing method according to some embodiments.
  • Fig. 13 is another flowchart of a data processing method according to some embodiments.
  • Fig. 14 is another structural diagram of a data processing device according to some embodiments.
  • Fig. 15 is another structural diagram of a data processing device according to some embodiments.
  • Fig. 16 is another structural diagram of a data processing device according to some embodiments.
  • Fig. 17 is another structural diagram of a data processing device according to some embodiments.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality” means two or more.
  • the expressions “coupled” and “connected” and their derivatives may be used.
  • the term “connected” may be used in describing some embodiments to indicate that two or more elements are in direct physical or electrical contact with each other.
  • the term “coupled” may be used when describing some embodiments to indicate that two or more elements are in direct physical or electrical contact.
  • the terms “coupled” or “communicatively coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
  • the embodiments disclosed herein are not necessarily limited by the context herein.
  • At least one of A, B and C has the same meaning as “at least one of A, B or C” and both include the following combinations of A, B and C: A only, B only, C only, A and B A combination of A and C, a combination of B and C, and a combination of A, B and C.
  • a and/or B includes the following three combinations: A only, B only, and a combination of A and B.
  • the term “if” is optionally interpreted to mean “when” or “at” or “in response to determining” or “in response to detecting,” depending on the context.
  • the phrases “if it is determined that " or “if [the stated condition or event] is detected” are optionally construed to mean “when determining ! or “in response to determining ! depending on the context Or “upon detection of [stated condition or event]” or “in response to detection of [stated condition or event]”.
  • an embodiment of the present disclosure provides a data processing method.
  • the method visually displays the distribution of sample data through a graphical interface, and filters the sample data through threshold setting. , and by reasonably dividing the positive and negative samples, the data analysis is more accurate.
  • the data processing method provided by the embodiments of the present disclosure can be applied to a general data analysis platform (machine learning platform), and can also be applied to a data analysis platform (production data analysis system) for specific scenarios.
  • the execution body of the data processing method provided by the embodiment of the present disclosure is a data processing device.
  • the data processing apparatus may be a terminal device or a server.
  • the specific form of the data processing apparatus is not particularly limited in the embodiments of the present disclosure, and it is only an exemplary description here.
  • the data processing device 100 includes at least one processor 101 , a memory 102 , a transceiver 103 and a communication bus 101 .
  • the processor 101 is the control center of the data processing device, and may be one processor, or may be a general term for multiple processing elements.
  • the processor 101 is a central processing unit (central processing unit, CPU), may also be a specific integrated circuit (application specific integrated circuit, ASIC), or is configured to implement one or more integrated circuits of the embodiments of the present disclosure .
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the processor 101 can execute various functions of the data processing device by running or executing software programs stored in the memory 102 and calling data stored in the memory 102 .
  • the processor 101 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 1 .
  • the data processing apparatus may include multiple processors, for example, the processor 101 and the processor 105 shown in FIG. 1 .
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • a processor herein may refer to one or more detection devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • Memory 102 may be read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types that can store information and instructions It can also be an electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be programmed by a computer Any other medium accessed, but not limited to.
  • the memory 102 may exist independently, and is connected to the processor 101 through the communication bus 104 .
  • the memory 102 can also be integrated with the processor 101 .
  • the memory 102 is used to store a software program for executing the solution of the present disclosure, and the execution is controlled by the processor 101 .
  • the transceiver 103 is used for communicating with other communication devices.
  • the transceiver 103 can also be used to communicate with a communication network, such as Ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN) and so on.
  • the transceiver 103 may include a receiving unit to implement a receiving function, and a sending unit to implement a sending function.
  • the communication bus 104 may be an industry standard architecture (industry standard architecture, ISA) bus, an external detection device interconnection (peripheral component, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 1 , but it does not mean that there is only one bus or one type of bus.
  • the structure of the data processing device shown in FIG. 1 does not constitute a limitation to the data processing device, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • a data processing method provided by an embodiment of the present disclosure, as shown in FIG. 2, the method includes the following steps:
  • the sample data includes feature data and detection data of the sample.
  • the detection data of each sample may be the degree of abnormality of a certain event.
  • the detection data of each sample can be the abnormal rate of the product.
  • the outlier rate is used to indicate the proportion of the number of outlier subsamples in each sample to the total number of subsamples included in the sample.
  • the detection data of each sample may also be measurement parameters of the sample, for example, parameters such as voltage, current, and power of the sample.
  • each piece of glass can be cut into multiple panels after various processes, and each panel then enters the inspection station for defect inspection.
  • the detection data of the sample can be the abnormal rate Ratio of the sample, and the abnormal rate of the sample refers to the ratio of the number of defective panels in each glass to the total number of panels cut by each glass.
  • the characteristic data of the sample may include but not limited to: product model, detection site, abnormal type, generation time, production equipment, environmental parameters, detection time, arrival rate and other characteristic parameters.
  • each sample may include multiple sub-samples, and the arrival rate of the sample is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample.
  • each piece of glass can be cut into multiple panels after various processes, and each panel then enters the inspection station for defect inspection.
  • the arrival rate of each sample refers to the ratio of the number of panels that each glass arrives at the inspection site to the total number of panels cut by the glass.
  • the abnormal type of the sample includes but is not limited to oil stains, corrosion, air bubbles, and the like.
  • the samples of the same abnormal type can be analyzed.
  • the generation time of the sample may be the production time or delivery time of the sample.
  • the environmental parameters of the sample include technological parameters of sample processing, temperature and pressure of the environment where the sample is processed, and other parameters.
  • the data processing device acquires the sample data in response to the user's input operation on the graphical interface, which may include: the data processing device receives the product model, testing site, production time, production equipment, environmental parameters and other characteristics input by the user on the graphical interface Data setting operation, in response to the setting operation input by the user, the data processing device acquires the sample data.
  • the data processing device obtains the sample data in response to the user's input operation on the graphical interface, and may also include: the data processing device receives an operation of uploading a file (such as a csv file) by the user, and in response to the operation, the data processing device obtains the sample data. data.
  • a file such as a csv file
  • the above methods for obtaining sample data include manual import by users, batch import and real-time data import.
  • the manual import includes the operation that the data processing device receives a file uploaded by a user (such as a csv file), and in response to this operation, the data processing device acquires sample data. That is, users can use the sample data collected by themselves as a sample set for abnormal diagnosis analysis.
  • Batch import can be used to import data in batches once or periodically by calling the API interface or address of HDFS.
  • Real-time data import can import data from data sources into data processing devices in real time through kafka and ETL tools.
  • the embodiment of the present disclosure does not limit the specific manner in which the data processing apparatus acquires the sample data, and this is only an exemplary description.
  • the abnormality rate Ratio and the measurement parameter Qtest may be used as judgment indicators for measuring sample abnormality, and the production equipment and environmental parameters of the sample may be used as the cause of the sample abnormality.
  • the detection data of the sample includes the abnormal rate as an example, as shown in Table 1, which is the sample set when the defect type is Defect_code1 .
  • the distribution graph of the samples shown in Table 1 can be displayed on the graphical interface, the horizontal axis of the distribution graph can be the generation time, and the vertical axis can be the abnormal rate.
  • the focus threshold is determined based on the detection data of the sample.
  • the focus threshold can divide samples into positive samples and negative samples.
  • positive samples can be called normal samples or non-abnormal samples, and negative samples can be called bad samples or abnormal samples.
  • the above focus threshold may be determined by the data processing device based on the detection data of the sample, or may be determined by the user according to the detection data of the sample.
  • the user can input the determined focus threshold in the data processing device, and the data processing device receives the user's setting operation on the focus threshold and displays it in the sample distribution diagram of the graphical interface.
  • the focus threshold is marked, and the data display effect of positive and negative samples is distinguished based on the focus threshold.
  • the data processing device or the user may determine the focus threshold based on the distribution of sample detection data.
  • the focus threshold may include a first focus threshold and a second focus threshold.
  • the acquisition of the focus threshold by the data processing device may include: the data processing device receives a user's setting operation on the first focus threshold.
  • the data processing device determines the second focus threshold according to the detection data of the sample. Two implementation manners for the data processing device to obtain the focus threshold are specifically described below.
  • the above step S203 includes: receiving the user’s setting operation on the first focus threshold, displaying the first focus threshold mark in the sample distribution diagram of the graphical interface, and distinguishing positive and negative sample data based on the first focus threshold display effect.
  • the detection data of the sample may be a measurement parameter.
  • the measurement parameter may be normal if it is greater than the threshold, and abnormal if it is less than the threshold. It is also possible that less than the threshold is normal and greater than the threshold is abnormal. It may also be normal within a range, and abnormal outside the range. It is also possible that it is abnormal within a range and normal outside the range. Users can set thresholds according to different parameters.
  • the above-mentioned first focus threshold may be a numerical value set by the user, or may be a range set by the user.
  • the first focus threshold includes the first value
  • the above-mentioned data display effect of distinguishing positive and negative samples based on the first focus threshold includes: sample-based detection The magnitude relationship between the data and the first value distinguishes the data display effect of positive and negative samples.
  • the data processing device may classify samples with detection data greater than the first value as negative samples, and classify samples with detection data smaller than the first value as positive samples based on the magnitude relationship between the detection data of the samples and the first value.
  • samples with an abnormality rate greater than the first value can be classified as negative samples, that is, samples above the first value shown in (a) in Figure 3 are negative samples, using Indicated by black dots.
  • the samples whose abnormality rate is less than the first numerical value are classified as positive samples, that is, the samples below the first numerical value shown in (a) in FIG. 3 are positive samples, represented by gray dots.
  • the data processing device may also classify samples with detected data greater than the first numerical value as positive samples, and classify samples with detected data smaller than the first numerical value as negative samples based on the magnitude relationship between the detected data of the samples and the first numerical value.
  • samples with an abnormality rate greater than the first value can be classified as positive samples, that is, samples above the first value shown in (b) in Figure 3 are positive samples, Indicated by gray dots.
  • the samples whose abnormality rate is less than the first numerical value are classified as negative samples, that is, the samples below the first numerical value shown in (b) in FIG. 3 are negative samples, which are represented by black dots.
  • the embodiment of the present disclosure does not limit the classification of samples with detection data greater than the first value by the data processing device as positive samples or negative samples.
  • the detection data can be determined according to the specific parameter type of the detection data. Samples greater than the first value are classified as positive samples or negative samples.
  • the first focus threshold may include the second value and the third value.
  • the above-mentioned data display effect of distinguishing positive and negative samples based on the first focus threshold Including: based on whether the detection data of the sample is greater than the second value and smaller than the third value to distinguish the data display effect of positive and negative samples.
  • the second value and the third value may form a range
  • the data processing device may classify samples with detection data greater than the second value and less than the third value as positive samples based on the relationship between the detection data of the sample and the size of the range , classify the samples whose detection data is smaller than the second numerical value or larger than the third numerical value as negative samples.
  • samples whose abnormal rate is greater than the second value and smaller than the third value can be classified as positive samples, that is, the abnormal rate shown in (a) in Figure 4 is within the second value
  • the samples above to below the third value are positive samples, represented by gray dots.
  • the samples whose abnormal rate is less than the second value or greater than the third value are divided into negative samples, that is, the abnormal rate shown in (a) in Figure 4 is below the second value, and the samples above the third value are negative samples, using Indicated by black dots.
  • the second value and the third value may form a range
  • the data processing device may classify samples whose detection data is less than the second value or greater than the third value as positive samples based on the size relationship between the detection data of the sample and the range , classify the samples whose detection data is greater than the second numerical value and less than the third numerical value as negative samples.
  • samples whose abnormal rate is greater than the second value and smaller than the third value can be classified as negative samples, that is, the abnormal rate shown in (b) in Figure 4 is within the second value
  • the samples above to below the third value are negative samples, represented by black dots.
  • the samples whose abnormal rate is less than the second value or greater than the third value are classified as positive samples, that is, the abnormal rate shown in (b) in Figure 4 is below the second value, and the samples above the third value are positive samples, using Indicated by gray dots.
  • acquiring the focus threshold for dividing positive and negative samples in the above step S203 may include: the data processing device acquires the focus threshold based on the distribution of sample detection data.
  • the data processing device may use the central tendency characteristics of the sample detection data, such as median and mean, as the reference focus threshold, and use the reference focus threshold as the second focus threshold.
  • the data processing device may use central tendency features of sample detection data such as median and mean as a reference focus threshold, and further determine the second focus threshold based on the distribution of detection samples divided by the reference focus threshold.
  • step S203 may include steps S2031-S2033.
  • the N samples may be samples screened in the following step S205, or samples not screened in step S205, which is not limited in the present disclosure.
  • the reference focus index may be a value corresponding to the reference focus threshold.
  • the reference focus index can be in, Indicates that N/2 is rounded up. For example, taking N as 401 as an example, the reference focus value is the median of 401 detection data, and the reference focus index is 201.
  • the reference focus index may be the index of the detection data closest to the mean value. For example, among the detection data of N samples arranged in sequence, the detection data of the 600th sample is closest to the mean value, then the reference focus index may be determined as 600.
  • the embodiment of the present disclosure does not limit the specific method for determining the reference focus value.
  • the following embodiments use the reference focus index as The reference focus value is the Take a test data as an example to illustrate.
  • the data processing device may take the median of the total number of samples as the reference focus index FocusIndex. For example, if the total number of samples is an even number, take N/2 as the reference focus index FocusIndex. If the total number of samples is odd, take the middle value FocusIndex for the reference focus index.
  • the AutoFocus algorithm may be used to determine the second focus threshold Focus.
  • determining the second focus threshold may include the following steps:
  • Step a average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean Mean l , and average the detection data of the N samples that are greater than the reference focus value to obtain the second mean Mean u .
  • the data processing device may divide the detection data of the N samples that are less than or equal to the reference focus value into LowerGroups, and the detection data greater than the reference focus value into UpperGroup.
  • the detection data in the LowerGroup are averaged to obtain a first mean value Mean l
  • the detection data in the UpperGroup are averaged to obtain a second mean value Mean u .
  • the data processing device averages x 1 to x 500 in SortedData to obtain Mean l according to the reference focus value x 500 , and calculates x 500 in SortedData Calculate mean u to x 1000 .
  • Step c repeat step a and step b until the value of the reference focus index remains unchanged before and after the update, and determine the second focus threshold based on the detection data corresponding to the reference focus index among the detection data of N samples arranged in sequence.
  • the data processing device averages x 1 to x 700 in the SortedData according to the reference focus value x 700 to obtain Mean l , and calculates the average of x 700 to x 1000 in the SortedData to obtain Mean u .
  • the detection data corresponding to the reference focus index may be determined as the second focus threshold.
  • the detection data corresponding to the reference focus index and the previous detection data may also be averaged to obtain the second focus threshold.
  • the present disclosure does not limit the specific method for determining the second focus threshold based on the detection data corresponding to the reference focus index.
  • the data processing device may determine the 750th abnormality rate x 750 in the array SortedData as the second focus threshold.
  • the second focus threshold can also be determined by averaging the 749th abnormality rate x 749 and the 750th abnormality rate x 750 in the array SortedData.
  • the mark of the second focus threshold is displayed in the graphical interface, and the second focus threshold can distinguish positive and negative samples, and the black circle above the second focus threshold Dots are negative samples, and gray dots below the second focus threshold are positive samples.
  • the data processing device determines the second focus threshold according to the detection data of the sample, and divides the positive and negative samples based on the second focus threshold, so that the data analysis based on the positive and negative samples is more accurate .
  • the focus threshold is determined based on the detection data of the sample by the data processing device, or the focus threshold is obtained by receiving the user's setting operation on the first focus threshold, and the positive and negative samples can be reasonably divided based on the focus threshold. So that the accuracy of data analysis is higher.
  • the data processing device divides the positive and negative samples based on the focus threshold, it performs sample feature analysis or machine learning model training based on the abnormal samples in the positive and negative samples, so as to analyze the sample data or train the model more accurately.
  • determining the cause of the sample anomaly includes performing sample feature analysis based on the positive and negative samples, and using statistical analysis methods such as WOE, Pearson correlation analysis, and decision tree algorithm to analyze the abnormality detection results of the sample.
  • the feature data of the test is analyzed to obtain the degree of influence of the feature data on the detection results.
  • determining the cause of sample anomalies also includes dividing based on positive and negative samples, as input data, using machine learning models such as logistic regression, random forest, LGBM, Xgboost, CatBoost, etc. Training, so as to obtain the sample anomaly prediction model and the importance ranking of sample feature data.
  • the present disclosure does not limit the specific method for determining the cause of sample abnormality based on the positive and negative samples, which is only an exemplary description here.
  • the data processing device determines the focus threshold based on the sample detection data, and displays the data display effect of distinguishing positive and negative samples based on the focus threshold in the sample distribution diagram of the graphical interface. That is, the embodiments of the present disclosure can reasonably divide the positive and negative samples, so that the sample data or the training model can be analyzed more accurately according to the divided positive and negative samples, so that the accuracy of the determined sample abnormal cause or the model is high.
  • FIG. 7 is another data processing method provided by the present disclosure.
  • the method may also include step S205.
  • the filtering threshold includes at least one of an abnormality rate threshold, an arrival rate threshold, a production equipment threshold, an environmental parameter threshold, a detection time threshold, or a generation time threshold.
  • step S205 may be performed before step S203, or may be performed after step S203, which is not limited in the present disclosure.
  • FIG. 7 illustrates an example in which step S205 is performed before step S203. It can be understood that when step S205 is performed before step S203, the data processing device may filter the sample data based on the filter threshold, determine the focus threshold based on the detection data of the filtered samples, and divide the positive and negative samples based on the focus threshold, Based on the positive and negative samples, determine the cause of the abnormal sample.
  • the data processing device may filter the sample data based on the filter threshold, and re-determine the focus threshold based on the detection data of the filtered samples, and divide the positive and negative samples based on the re-determined focus threshold, Based on the positive and negative samples, the cause of the sample abnormality is determined.
  • the above filtering operation may include a setting operation and a selection operation.
  • the selection operation may include a frame selection operation.
  • each of the foregoing filtering thresholds may include one numerical value, or may include multiple numerical values, which is not limited in the present disclosure.
  • the filtering threshold includes an abnormal rate threshold as an example, and the abnormal rate is used to indicate the ratio of the number of abnormal sub-samples in each sample to the total number of sub-samples included in the sample. Since the amount of sample data acquired by the data processing device is relatively large, the user can set the abnormal rate threshold, and the data processing device can filter the sample data based on the abnormal rate threshold set by the user, and filter out the abnormal rate lower than the abnormal rate threshold. sample. It is understandable that the reliability of sample analysis can be improved by deleting samples with low abnormal rate and no reference value.
  • the filtering threshold includes an arrival rate threshold as an example, and the arrival rate is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample. Since some sub-samples in each sample may not reach the detection site for detection, the actual number of sub-samples detected may be less than the total number of sub-samples included in the sample. Therefore, for samples with a low abnormal rate, it may be because some sub-samples have not been detected, resulting in a low abnormal rate of the sample.
  • each piece of glass can be cut into multiple panels after various processes, and each panel then enters the inspection station for defect inspection.
  • the arrival rate of each glass is the ratio of the number of panels arriving at the detection site in each glass to the total number of cut panels, and the abnormality rate of each glass is the ratio of the number of detected abnormal panels to the total number of cut panels.
  • the user can set the arrival rate threshold based on experience (for example, the arrival rate threshold is 0.9), and the data
  • the processing device screens the sample data based on the arrival rate threshold set by the user, and filters out samples whose arrival rate is lower than the arrival rate threshold of 0.9.
  • the user can set the production equipment threshold and the environmental parameter threshold, and the data processing device based on the production equipment threshold and the environmental parameter threshold set by the user
  • the environmental parameter threshold can filter the sample data, filter out the samples that do not meet the production equipment threshold and the environmental parameter threshold, and keep the samples that meet the production equipment threshold and the environmental parameter threshold.
  • the data processing device can improve the purity of diagnostic analysis data and improve the accuracy of data analysis by deleting samples that are useless for analysis.
  • the display of the data processing device In order to narrow the scope of sample analysis and improve the reliability of data analysis, after the user can input the setting operation of the threshold value of the production equipment and the threshold value of the environmental parameter, in response to the setting operation of the user, the display of the data processing device The filtered sample (the lightest gray dot in FIG. 8 ) is displayed on the interface, and the sample is filtered out. The number and distribution of filtered samples will change, and the focus threshold can be obtained again in conjunction with step S203.
  • the user can select the detection time threshold, and the data processing device deletes samples whose detection time meets the detection time threshold selected by the user based on the detection time threshold selected by the user.
  • samples whose detection time does not meet a user-selected detection time threshold can also be deleted.
  • the display interface of the data processing device displays the detection time threshold selected by the user, and deletes the detection time that satisfies the user's selection. Detect samples at time, and classify positive and negative samples for filtered samples based on a focus threshold.
  • the user can select the generation time threshold, and the data processing device deletes samples whose generation time meets the generation time threshold selected by the user based on the generation time selected by the user.
  • samples whose generation time does not meet a user-selected generation time threshold can also be deleted.
  • the data processing device filters out the samples whose generation time does not meet the generation time set by the user, and displays the Display the samples whose generation time matches the user-set generation time, and divide the filtered samples into positive and negative samples based on the focus threshold.
  • the data processing device may base on multiple thresholds set by the user Thresholds are used to filter the sample data in turn.
  • the present disclosure does not limit the order in which the data processing device filters samples based on multiple filtering thresholds.
  • the data processing device screens the sample data based on the filtering threshold, and determines the focus threshold based on the detection data of the filtered samples, and displays in the sample distribution diagram of the graphical interface to distinguish positive samples based on the focus threshold.
  • the data for negative samples shows the effect. That is, the embodiment of the present disclosure can filter some samples that have no reference value or affect the accuracy of the sample analysis results by screening the sample data, which can improve the reliability of the sample data and make the sample analysis results more reliable. Moreover, by reasonably dividing the positive and negative samples, the sample data or the training model can be analyzed more accurately according to the divided positive and negative samples, so that the determined cause of the sample abnormality or the accuracy of the model is higher.
  • FIG. 11 is another data processing method provided by an embodiment of the present disclosure. As shown in FIG. 11, the method includes the following steps:
  • the sample data includes feature data and detection data of the sample.
  • step S1101 reference may be made to step S201, which will not be repeated here.
  • S1102. Determine a focus threshold based on the detection data of the sample.
  • the data processing device determines the focus threshold based on the detection data of the sample, which may include steps S11021-S11022.
  • the reference focus index may be a value corresponding to the reference focus threshold.
  • the reference focus index can be in, Indicates that N/2 is rounded up.
  • the reference focus index may be the index of the detection data closest to the mean value.
  • the embodiment of the present disclosure does not limit the specific method for determining the reference focus value.
  • the following embodiments use the reference focus index as The reference focus value is the Take a test data as an example to illustrate.
  • the data processing device may take the median value of the total number of samples as the reference focus index FocusIndex. For example, if the total number of samples is an even number, take N/2 as the reference focus index FocusIndex. If the total number of samples is odd, take the middle value FocusIndex for the reference focus index.
  • the AutoFocus algorithm may be used to determine the second focus threshold Focus.
  • step S11022 based on the detection data of reference focus value and N samples, determine the second focus threshold, may include the following steps:
  • Step a average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean Mean l , and average the detection data of the N samples that are greater than the reference focus value to obtain the second mean Mean u .
  • the data processing device may divide the detection data of the N samples that are less than or equal to the reference focus value into LowerGroups, and the detection data greater than the reference focus value into UpperGroup.
  • the detection data in the LowerGroup are averaged to obtain a first mean value Mean l
  • the detection data in the UpperGroup are averaged to obtain a second mean value Mean u .
  • Step c repeat step a and step b until the value of the reference focus index remains unchanged before and after the update, and determine the second focus threshold based on the detection data corresponding to the reference focus index among the detection data of N samples arranged in sequence.
  • the detection data corresponding to the reference focus index may be determined as the second focus threshold.
  • the detection data corresponding to the reference focus index and the previous detection data may also be averaged to obtain the second focus threshold.
  • the present disclosure does not limit the specific method for determining the second focus threshold based on the detection data corresponding to the reference focus index.
  • the data processing device may divide the samples into positive and negative samples based on the magnitude relationship between the detection data of the samples and the focus threshold.
  • the data processing device may classify samples with detection data greater than the focus threshold as negative samples, and classify detection data less than the focus threshold as negative samples based on the magnitude relationship between the detection data of the samples and the focus threshold. In-focus thresholded samples are classified as positive samples.
  • the data processing device may also classify samples with detection data greater than the focus threshold as positive samples, and classify samples with detection data smaller than the focus threshold as negative samples based on the magnitude relationship between the detection data of the samples and the focus threshold.
  • the data processing device may classify samples with detection data within the numerical range as negative samples based on whether the detection data of the samples are within the numerical range, and classify the detection data Samples outside this value range are classified as positive samples.
  • the data processing device may also classify samples with detection data within the numerical range as positive samples and samples with detection data outside the numerical range as negative samples based on whether the detection data of the samples is within the numerical range.
  • the data processing device divides the positive and negative samples based on the focus threshold, it performs sample feature analysis or machine learning model training based on the abnormal samples in the positive and negative samples, so as to analyze the sample data or train the model more accurately.
  • determining the cause of the sample anomaly includes performing sample feature analysis based on the positive and negative samples, and using statistical analysis methods such as WOE, Pearson correlation analysis, and decision tree algorithm to analyze the abnormality detection results of the sample.
  • the feature data of the test is analyzed to obtain the degree of influence of the feature data on the detection results.
  • determining the cause of sample anomalies also includes dividing based on positive and negative samples, as input data, using machine learning models such as logistic regression, random forest, LGBM, Xgboost, CatBoost, etc. Training, so as to obtain the sample anomaly prediction model and the importance ranking of sample feature data.
  • the data processing device determines the focus threshold based on the detection data of the sample. Based on the focus threshold, the positive and negative samples can be reasonably divided, so that the sample data can be analyzed more accurately according to the positive and negative samples divided.
  • the model is trained so that the determined cause of the sample anomaly or the accuracy of the model is high.
  • FIG. 13 is another data processing method provided by the present disclosure.
  • the method may further include step S1105.
  • the filtering threshold includes at least one of an abnormality rate threshold, an arrival rate threshold, a production equipment threshold, an environmental parameter threshold, a detection time threshold, or a generation time threshold.
  • step S1105 may be performed before step S1102, or may be performed after step S1102, which is not limited in the present disclosure.
  • FIG. 13 illustrates an example in which step S1105 is performed before step S1102. It can be understood that when step S1105 is performed before step S1102, the data processing device may filter the sample data based on the filter threshold, determine the focus threshold based on the detection data of the filtered samples, and divide the positive and negative samples based on the focus threshold, Based on the positive and negative samples, determine the cause of the abnormal sample.
  • the data processing device may filter the sample data based on the filter threshold, and re-determine the focus threshold based on the detection data of the filtered samples, and divide the positive and negative samples based on the re-determined focus threshold, Based on the positive and negative samples, the cause of the sample abnormality is determined.
  • step S1105 reference may be made to step S205, which will not be repeated here.
  • the data processing device may base on multiple thresholds set by the user Thresholds are used to filter the sample data in turn.
  • the present disclosure does not limit the order in which the data processing device filters samples based on multiple filtering thresholds.
  • the data processing device filters the sample data based on the filter threshold, determines the focus threshold based on the detection data of the filtered samples, and divides positive and negative samples based on the focus threshold. That is, the embodiment of the present disclosure can filter some samples that have no reference value or affect the accuracy of the sample analysis results by screening the sample data, so as to improve the reliability of the sample data and make the sample analysis results more reliable. Moreover, by reasonably dividing the positive and negative samples, the sample data or the training model can be more accurately analyzed according to the divided positive and negative samples, so that the determined cause of the sample abnormality or the accuracy of the model is relatively high.
  • the embodiment of the present disclosure also provides a data processing device.
  • FIG. 14 it is a structural diagram of a data processing device provided by an embodiment of the present disclosure.
  • the data processing device 140 is configured to execute the data processing method described in any one of the above embodiments.
  • the data processing device 140 may include: an acquisition module 141 , a display module 142 , a determination module 143 and a screening module 144 .
  • the obtaining module 141 is configured to obtain sample data in response to user input operations on the graphical interface.
  • the sample data includes feature data and detection data of the sample.
  • the display module 142 is configured to display a sample distribution graph on a graphical interface based on the sample data.
  • the acquiring module 141 is further configured to acquire a focus threshold for dividing positive and negative samples.
  • the display module 142 is further configured to display a focus threshold mark in the sample distribution diagram of the graphical interface based on the focus threshold acquired by the acquisition module 141, and to distinguish the data display effect of positive and negative samples based on the focus threshold.
  • the focus threshold is determined based on the detection data of the sample.
  • the determination module 143 is configured to determine the cause of the sample abnormality based on the positive and negative samples.
  • the feature data of the sample includes at least one of product model, detection site, abnormal type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the sample includes at least one of abnormality rate or measurement parameters.
  • the focus threshold includes a second focus threshold.
  • the acquisition module 141 is specifically configured to arrange the detection data of the N samples in order from small to large, and arrange the detection data of the N samples The median or mean value of the detection data is used as a reference focus value; based on the reference focus value and the detection data of N samples, a second focus threshold is determined.
  • the display module 142 is specifically configured to display the second focus threshold mark in the sample distribution diagram of the graphical interface, and distinguish the data display effect of positive and negative samples based on the second focus threshold.
  • the acquisition module 141 is also specifically configured to perform the following steps: step a, average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean value Mean l , and divide the N samples Among the detection data of the sample, the detection data greater than the reference focus value are averaged to obtain the second mean value Mean u . Step b.
  • the focus threshold includes a first focus threshold, and the first focus threshold is one or more.
  • the acquiring module 141 is further specifically configured to receive a user's setting operation on the first focus threshold.
  • the display module 142 is further configured to display the first focus threshold mark in the sample distribution diagram of the graphical interface, and distinguish the data display effect of positive and negative samples based on the first focus threshold.
  • the first focus threshold includes a first value
  • the display module 142 is specifically configured to distinguish the data display effect of positive and negative samples based on the magnitude relationship between the detection data of the sample and the first value.
  • the first focus threshold includes a second value and a third value
  • the second value is smaller than the third value
  • the display module 142 is specifically used to determine whether the detection data based on the sample is greater than the second value and smaller than the third value The data display effect of distinguishing positive and negative samples.
  • the filtering module 144 is configured to filter the sample data based on the user's filtering operation on the filtering threshold.
  • the display module 142 is also used to display the distribution diagram of the filtered samples on the graphical interface.
  • the filtering threshold includes at least one of abnormal rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the sample includes multiple sub-samples, and the abnormal rate is used to indicate The ratio of the number of abnormal sub-samples to the total number of sub-samples included in the sample; the arrival rate is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample.
  • the filter operation includes a setting operation and a selection operation.
  • the data processing device 140 provided in the embodiment of the present disclosure includes but is not limited to the above modules.
  • the embodiment of the present disclosure also provides a data processing device.
  • FIG. 15 it is a structural diagram of a data processing device provided by an embodiment of the present disclosure.
  • the data processing device 150 is configured to execute the data processing method in any one of the above embodiments.
  • the data processing device 150 may include: an acquisition module 151 , a determination module 152 , a division module 153 and a screening module 154 .
  • the acquiring module 151 is configured to acquire sample data, and the sample data includes characteristic data and detection data of the sample.
  • the determination module 152 is configured to determine a focus threshold based on the detection data of the sample.
  • the division module 153 is configured to divide the samples into positive and negative samples based on the focus threshold determined by the determination module.
  • the determination module 152 is further configured to determine the cause of the sample abnormality based on the positive and negative samples.
  • the feature data of the sample includes at least one of product model, detection site, abnormality type, arrival rate, production equipment, environmental parameters, detection time, or generation time.
  • the detection data of the sample includes at least one of abnormality rate or measurement parameters.
  • the focus threshold includes a second focus threshold, the number of samples is N, and the determination module 152 is specifically configured to: arrange the detection data of N samples in order from small to large, and arrange the detection data of N samples The median or mean value of is used as a reference focus value; based on the reference focus value and the detection data of N samples, a second focus threshold is determined.
  • the determination module 152 is also specifically configured to perform the following steps: step a, average the detection data of the N samples that are less than or equal to the reference focus value to obtain the first mean value Mean l , and divide the N samples Among the detection data of the sample, the detection data greater than the reference focus value are averaged to obtain the second mean value Mean u . Step b.
  • the screening module 154 is configured to screen the sample data based on a filtering threshold.
  • the filtering threshold includes at least one of abnormal rate threshold, arrival rate threshold, production equipment threshold, environmental parameter threshold, detection time threshold, or generation time threshold; the sample includes multiple sub-samples, and the abnormal rate is used to indicate The ratio of the number of abnormal sub-samples to the total number of sub-samples included in the sample; the arrival rate is used to indicate the ratio of the number of sub-samples actually detected in each sample to the total number of sub-samples included in the sample.
  • the data processing device 160 includes a memory 161 and a processor 162; the memory 161 and the processor 162 are coupled; the memory 161 is used to store computer program codes, and the computer program codes include computer instructions.
  • the processor 162 executes the computer instructions, the data processing device 160 is made to execute various steps performed by the data processing device in the method flow shown in the above method embodiments.
  • the acquisition module 141 , the display module 142 , the determination module 143 and the screening module 144 can be implemented by the processor 162 shown in FIG. 16 calling computer program codes in the memory 161 .
  • the specific execution process refer to the description of the data processing method shown in FIG. 2 , FIG. 3 , and FIG. 7 , which will not be repeated here.
  • the data processing device 170 includes a memory 171 and a processor 172; the memory 171 and the processor 172 are coupled; the memory 171 is used to store computer program codes, and the computer program codes include computer instructions.
  • the processor 172 executes the computer instructions, the data processing device 170 is made to execute various steps performed by the data processing device in the method flow shown in the above method embodiments.
  • the acquisition module 151 , the determination module 152 , the division module 153 and the screening module 154 can be realized by calling the computer program code in the memory 171 by the processor 172 shown in FIG. 17 .
  • the specific execution process please refer to the description of the data processing method shown in FIG. 11 , FIG. 12 , and FIG. 13 , which will not be repeated here.
  • Some embodiments of the present disclosure provide a computer-readable storage medium (for example, a non-transitory computer-readable storage medium), where computer program instructions are stored in the computer-readable storage medium, and when the computer program instructions are run on a processor , so that the processor executes one or more steps in the data processing method described in any one of the above embodiments.
  • a computer-readable storage medium for example, a non-transitory computer-readable storage medium
  • the above-mentioned computer-readable storage medium may include, but is not limited to: a magnetic storage device (for example, a hard disk, a floppy disk, or a magnetic tape, etc.), an optical disk (for example, a CD (Compact Disk, a compact disk), a DVD (Digital Versatile Disk, Digital Versatile Disk), etc.), smart cards and flash memory devices (for example, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), card, stick or key drive, etc.).
  • Various computer-readable storage media described in this disclosure can represent one or more devices and/or other machine-readable storage media for storing information.
  • the term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
  • Some embodiments of the present disclosure also provide a computer program product.
  • the computer program product includes computer program instructions. When the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute one or more steps in the data processing method as described in the above-mentioned embodiments.
  • Some embodiments of the present disclosure also provide a computer program.
  • the computer program When the computer program is executed on a computer, the computer program causes the computer to execute one or more steps in the data processing method described in the above-mentioned embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

一种数据处理方法,该方法包括:响应于用户在图形界面的输入操作,获取样本数据(S201),该样本数据包括样本的特征数据和检测数据;基于样本数据,在图形界面显示样本分布图(S202);获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果(S203);其中,聚焦阈值基于样本的检测数据确定;基于正负样本,确定样本异常的原因(S204)。

Description

数据处理方法及装置 技术领域
本公开涉及数据处理领域,尤其涉及一种数据处理方法及装置。
背景技术
数据分析过程中,一般要对样本数据进行预处理,并对样本分布进行标记,以使得后续样本特征分析或机器学习模型训练时,能够提高分析效率以及准确性。
发明内容
一方面,提供一种数据处理方法。该数据处理方法包括:首先,响应于用户在图形界面的输入操作,获取样本数据,该样本数据包括样本的特征数据和检测数据;然后,基于该样本数据,在图形界面显示样本分布图;再获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示该聚焦阈值标记,并基于该聚焦阈值区分正负样本的数据显示效果;其中,该聚焦阈值基于样本的检测数据确定;最后,基于该正负样本,确定样本异常的原因。
在一些实施例中,上述聚焦阈值包括第一聚焦阈值,该第一聚焦阈值为一个或多个,上述获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果,包括:接收用户对第一聚焦阈值的设定操作,在图形界面的样本分布图中显示第一聚焦阈值标记,并基于该第一聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,上述第一聚焦阈值包括第一数值,上述基于第一聚焦阈值区分正负样本的数据显示效果,包括:基于样本的检测数据与第一数值的大小关系区分正负样本的数据显示效果。
另一些实施例中,上述第一聚焦阈值包括第二数值和第三数值,该第二数值小于第三数值,上述基于第一聚焦阈值区分正负样本的数据显示效果,包括:基于样本的检测数据是否大于第二数值且小于第三数值区分正负样本的数据显示效果。
另一些实施例中,上述聚焦阈值还包括第二聚焦阈值,样本的数量为N,上述获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果,包括:将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值;在图形界面的样本分布图中显示第二聚焦阈值标记,并基于第 二聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,上述基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值,包括以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u;步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值;步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,上述方法还包括:基于用户对过滤阈值的过滤操作,对样本数据进行筛选,并在图形界面显示筛选后的样本的分布图。
另一些实施例中,上述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;上述样本包括多个子样本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
另一些实施例中,上述过滤操作包括设定操作和选择操作。
另一些实施例中,上述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
另一些实施例中,上述样本的检测数据包括异常率或测量参数中的至少一种。
另一方面,提供一种数据处理方法,该方法包括:首先,获取样本数据,该样本数据包括样本的特征数据和检测数据;然后,基于样本的检测数据,确定聚焦阈值;再基于该聚焦阈值,将样本划分为正负样本;最后,基于该正负样本,确定样本异常的原因。
在一些实施例中,上述聚焦阈值包括第二聚焦阈值,样本的数量为N,上述根据样本的检测数据,确定聚焦阈值,包括:将N个样本的检测数据按照从小到大依次排列,N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。
另一些实施例中,上述基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值,包括以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u;步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值;步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,上述方法还包括:基于过滤阈值,对样本数据进行筛选。
另一些实施例中,上述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;上述样本包括多个子样本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
另一些实施例中,上述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
另一些实施例中,上述样本的检测数据包括异常率或测量参数中的至少一种。
又一方面,提供一种数据处理装置,包括:获取模块,用于响应于用户在图形界面的输入操作,获取样本数据,样本数据包括样本的特征数据和检测数据;显示模块,用于基于获取模块获取的样本数据,在图形界面显示样本分布图;获取模块,还用于获取用于划分正负样本的聚焦阈值;显示模块,还用于基于获取模块获取的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果;其中,聚焦阈值基于样本的检测数据确定;确定模块,用于基于正负样本,确定样本异常的原因。
在一些实施例中,上述聚焦阈值包括第一聚焦阈值,第一聚焦阈值为一个或多个;上述获取模块,具体还用于接收用户对第一聚焦阈值的设定操作; 上述显示模块,还用于在图形界面的样本分布图中显示第一聚焦阈值标记,并基于第一聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,上述第一聚焦阈值包括第一数值,上述显示模块,具体用于:基于样本的检测数据与第一数值的大小关系区分正负样本的数据显示效果。
另一些实施例中,上述第一聚焦阈值包括第二数值和第三数值,第二数值小于第三数值,显示模块,具体还用于:基于样本的检测数据是否大于第二数值且小于第三数值区分正负样本的数据显示效果。
另一些实施例中,上述聚焦阈值包括第二聚焦阈值,样本的数量为N,获取模块,具体用于:将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值;在图形界面的样本分布图中显示第二聚焦阈值标记,并基于第二聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,上述获取模块,具体还用于执行以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u;步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值;步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,上述数据处理装置还包括筛选模块;该筛选模块,用于基于用户对过滤阈值的过滤操作,对样本数据进行筛选;显示模块,还用于在图形界面显示筛选后的样本的分布图。
另一些实施例中,上述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;样本包括多个子样本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
另一些实施例中,上述过滤操作包括设定操作和选择操作。
另一些实施例中,上述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
另一些实施例中,上述样本的检测数据包括异常率或测量参数中的至少一种。
又一方面,提供一种数据处理装置,该装置包括:获取模块,用于获取样本数据,样本数据包括样本的特征数据和检测数据;确定模块,用于基于样本的检测数据,确定聚焦阈值;划分模块,用于基于聚焦阈值,将样本划分为正负样本;确定模块,还用于基于正负样本,确定样本异常的原因。
在一些实施例中,上述聚焦阈值包括第二聚焦阈值,样本的数量为N,确定模块,具体用于:将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。
另一些实施例中,上述确定模块,具体还用于执行以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u;步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值;步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,上述数据处理装置还包括筛选模块,该筛选模块,用于:基于过滤阈值,对样本数据进行筛选。
另一些实施例中,上述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;样本包括多个子样本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
另一些实施例中,上述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
另一些实施例中,上述样本的检测数据包括异常率或测量参数中的至少 一种。
又一方面,提供一种数据处理装置,该装置包括存储器和处理器;存储器和处理器耦合;存储器用于存储计算机程序代码,计算机程序代码包括计算机指令;其中,当处理器执行所述计算机指令时,使得该装置执行如上述任一实施例所述的数据处理方法中的一个或多个步骤。
又一方面,提供一种非瞬态的计算机可读存储介质,所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使得所述处理器执行如上述任一实施例所述的数据处理方法中的一个或多个步骤。
又一方面,提供一种计算机程序产品,该计算机程序产品包括计算机程序,在计算机上执行所述计算机程序指令时,所述计算机程序指令使计算机执行如上述任一实施例所述的数据处理方法中的一个或多个步骤。
附图说明
为了更清楚地说明本公开中的技术方案,下面将对本公开一些实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例的附图,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。此外,以下描述中的附图可以视作示意图,并非对本公开实施例所涉及的产品的实际尺寸、方法的实际流程、信号的实际时序等的限制。
图1为根据一些实施例的数据处理装置的一种结构图;
图2为根据一些实施例的数据处理方法的一种流程图;
图3为根据一些实施例的数据处理方法的一种显示效果图;
图4为根据一些实施例的数据处理方法的另一种显示效果图;
图5为根据一些实施例的数据处理方法的另一种流程图;
图6为根据一些实施例的数据处理方法的再一种显示效果图;
图7为根据一些实施例的数据处理方法的再一种流程图;
图8为根据一些实施例的数据处理方法的再一种显示效果图;
图9为根据一些实施例的数据处理方法的再一种显示效果图;
图10为根据一些实施例的数据处理方法的再一种显示效果图;
图11为根据一些实施例的数据处理方法的再一种流程图;
图12为根据一些实施例的数据处理方法的再一种流程图;
图13为根据一些实施例的数据处理方法的再一种流程图;
图14为根据一些实施例的数据处理装置的另一种结构图;
图15为根据一些实施例的数据处理装置的再一种结构图;
图16为根据一些实施例的数据处理装置的再一种结构图;
图17为根据一些实施例的数据处理装置的再一种结构图。
具体实施方式
下面将结合附图,对本公开一些实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开所提供的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括(comprise)”及其其他形式例如第三人称单数形式“包括(comprises)”和现在分词形式“包括(comprising)”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一个实施例(one embodiment)”、“一些实施例(some embodiments)”、“示例性实施例(exemplary embodiments)”、“示例(example)”、“特定示例(specific example)”或“一些示例(some examples)”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方式包括在任何一个或多个实施例或示例中。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
在描述一些实施例时,可能使用了“耦接”和“连接”及其衍伸的表达。例如,描述一些实施例时可能使用了术语“连接”以表明两个或两个以上部件彼此间有直接物理接触或电接触。又如,描述一些实施例时可能使用了术语“耦接”以表明两个或两个以上部件有直接物理接触或电接触。然而,术语“耦接”或“通信耦合(communicatively coupled)”也可能指两个或两个以上部件彼此间并无直接接触,但仍彼此协作或相互作用。这里所公开的实施例并不必然限制于本文内容。
“A、B和C中的至少一个”与“A、B或C中的至少一个”具有相同含义,均包括以下A、B和C的组合:仅A,仅B,仅C,A和B的组合,A和C的组合,B和C的组合,及A、B和C的组合。
“A和/或B”,包括以下三种组合:仅A,仅B,及A和B的组合。
如本文中所使用,根据上下文,术语“如果”任选地被解释为意思是“当……时”或“在……时”或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定……”或“如果检测到[所陈述的条件或事件]”任选地被解释为是指“在确定……时”或“响应于确定……”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
本文中“适用于”或“被配置为”的使用意味着开放和包容性的语言,其不排除适用于或被配置为执行额外任务或步骤的设备。
另外,“基于”或“根据”的使用意味着开放和包容性,因为“基于”或“根据”一个或多个所述条件或值的过程、步骤、计算或其他动作在实践中可以基于额外条件或超出所述的值。
目前,半导体、面板等领域中,受生产工序、生产设备等因素的影响,生产的产品可能存在各种缺陷。为了满足日益增长的生产需求,提升产品的良率,分析缺陷产品产生缺陷的原因是非常有必要的。
数据分析过程中,一般要对样本数据进行预处理,并对样本分布进行标记,以使得后续样本特征分析或机器学习模型训练时,能够提高分析效率以及准确性。
通常,在对样本数据进行分析时,主要依靠人力定位异常原因,因此处理时效和准确率都及其受限,很难满足日益增长的生产需求。为了提升数据分析的效率和准确性,可以通过机器学习算法确定异常原因。但是,采用机器学习算法分析异常原因时,如果不管样本异常率的高低,对所有样本进行分析,可能导致数据量过大,影响机器学习算法的运行效率,而且如果存在大量异常率特别低的样本,可能会对异常原因的准确性造成影响。
为了提升数据分析的准确率,本公开实施例提供一种数据处理方法,该方法在执行数据分析任务过程中,对样本数据的分布通过图形界面直观展示,并通过阈值设置,对样本数据进行筛选,并通过合理的划分正负样本,从而使得数据分析更准确。
本公开实施例提供的数据处理方法,可以应用于通用的数据分析平台(机器学习平台),也可以应用于针对特定场景的数据分析平台(生产数据分析系统)。
本公开实施例提供的数据处理方法的执行主体为数据处理装置。该数据处理装置可以终端设备或服务器,本公开实施例中对数据处理装置的具体形式不做特殊限制,在此仅是示例性说明。
如图1所示,该数据处理装置100包括至少一个处理器101,存储器102、 收发器103以及通信总线101。
下面结合图1对该数据处理装置的各个构成部件进行具体的介绍:
处理器101是数据处理装置的控制中心,可以是一个处理器,也可以是多个处理元件的统称。例如,处理器101是一个中央处理器(central processing unit,CPU),也可以是特定集成电路(application specific integrated circuit,ASIC),或者是被配置成实施本公开实施例的一个或多个集成电路。
其中,处理器101可以通过运行或执行存储在存储器102内的软件程序,以及调用存储在存储器102内的数据,执行数据处理装置的各种功能。
在具体的实现中,作为一种实施例,处理器101可以包括一个或多个CPU,例如图1中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,数据处理装置可以包括多个处理器,例如图1中所示的处理器101和处理器105。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个检测设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
存储器102可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器102可以是独立存在,通过通信总线104与处理器101相连接。存储器102也可以和处理器101集成在一起。
其中,所述存储器102用于存储执行本公开方案的软件程序,并由处理器101来控制执行。
收发器103,用于与其他通信装置之间进行通信。当然,收发器103还可以用于与通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。收发器103可以包括接收单元实现接收功能,以及发送单元实现发送功能。
通信总线104,可以是工业标准体系结构(industry standard architecture,ISA)总线、外部检测设备互连(peripheral component,PCI)总线或扩展工 业标准体系结构(extended industry standard architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
图1中示出的数据处理装置结构并不构成对数据处理装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
结合图1,如图2所示,为本公开实施例提供的一种数据处理方法,如图2所示,该方法包括以下步骤:
S201、响应于用户在图形界面的输入操作,获取样本数据。
该样本数据包括样本的特征数据和检测数据。
可选的,每个样本的检测数据可以是某一事件的异常程度。在生产过程中,每个样本的检测数据可以是产品的异常率。异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例。每个样本的检测数据也可以是样本的测量参数,例如,样本的电压、电流、功率等参数。
例如,以样本为玻璃GLASS为例,每张玻璃经过各工序后可以切割成多个面板panel,各个panel再进入检测站点进行缺陷检测。样本的检测数据可以为样本的异常率Ratio,样本的异常率是指每张玻璃中有缺陷的panel数与每张玻璃切割的总panel数的比例。
可选的,样本的特征数据可以包括但不限于:产品型号、检测站点、异常类型、生成时间、生产设备、环境参数、检测时间、到达率等特征参数。
示例性的,每个样本可以包括多个子样本,样本的到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
例如,以样本为玻璃GLASS,子样本为面板panel为例,每张玻璃经过各工序后可以切割成多个面板panel,各个panel再进入检测站点进行缺陷检测。每个样本的到达率指每张玻璃到达检测站点的panel数与该张玻璃切割的总panel数的比例。
可选的,样本的异常类型包括但不限于油污、腐蚀、气泡等。本公开实施例在分析样本的异常原因时,可以对同一种异常类型的样本进行分析。
可选的,样本的生成时间可以为样本的生产时间或出厂时间。
可选的,样本的环境参数包括样本加工的工艺参数、样本加工时所处的环境的温度、压力等参数。
示例性的,数据处理装置响应于用户在图形界面的输入操作,获取样本数据,可以包括:数据处理装置接收用户在图形界面输入的产品型号、检测站点、生成时间、生产设备、环境参数等特征数据的设定操作,响应于用户 输入的设定操作,数据处理装置获取样本数据。
示例性的,数据处理装置响应于用户在图形界面的输入操作,获取样本数据,还可以包括:数据处理装置接收用户上传文件(如csv文件)的操作,响应于该操作,数据处理装置获取样本数据。
可选的,上述获取样本数据的方法包括用户手动导入、批量导入和实时数据导入。手动导入包括数据处理装置接收用户上传文件(如csv文件)的操作,响应于该操作,数据处理装置获取样本数据。即用户可以将自己收集的样本数据作为异常诊断分析的样本集。批量导入可通过调用HDFS的API接口或地址进行一次性或定期批量导入数据,实时数据导入可通过kafka以及ETL工具实时将数据源中的数据实时导入数据处理装置。本公开实施例对于数据处理装置获取样本数据的具体方式并不限定,在此仅是示例性说明。
可选的,本公开实施例中可以将异常率Ratio和测量参数Qtest作为衡量样本异常的判断指标,将样本的生产设备、环境参数等作为导致样本异常的原因。
例如,以样本的特征数据包括检测站点Check Step、缺陷种类Defect_Name、到达率Input_Ratio、生成时间END_TIME,样本的检测数据包括异常率为例,如表1所示,为缺陷种类为Defect_code1时的样本集。
表1
GlassID Check Step Defect_Name Ratio Input_Ratio END_TIME
GlassID 1 Check Step1 Defect_code1 0.088 0.953 2021-01-24 08:25:03
GlassID 2 Check Step1 Defect_code1 0.264 0.924 2021-01-28 07:43:11
GlassID n Check Step1 Defect_code1 0.011 0.837 2021-02-11 20:37:45
S202、基于样本数据,在图形界面显示样本分布图。
示例性的,以表1所示的样本数据为例,可以在图形界面显示表1所示的样本的分布图,该分布图的横轴可以为生成时间,纵轴可以为异常率。
S203、获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果。
其中,该聚焦阈值基于样本的检测数据确定。聚焦阈值可以将样本划分为正样本和负样本。可选的,正样本可以称为正常样本或无异常样本,负样本可以称为不良样本或异常样本。
可选的,上述聚焦阈值可以是数据处理装置基于样本的检测数据确定的, 也可以是用户根据样本的检测数据确定的。在聚焦阈值为用户根据样本的检测数据确定的情况下,用户可以在数据处理装置输入其确定的聚焦阈值,数据处理装置接收用户对聚焦阈值的设定操作,在图形界面的样本分布图中显示聚焦阈值标记,并基于聚焦阈值区分正负样本的数据显示效果。
可选的,数据处理装置或用户可以基于样本检测数据的分布情况确定聚焦阈值。
可选的,聚焦阈值可以包括第一聚焦阈值和第二聚焦阈值。
示例性的,数据处理装置获取聚焦阈值可以包括:数据处理装置接收用户对第一聚焦阈值的设定操作。或者,数据处理装置根据样本的检测数据确定第二聚焦阈值。下面分别对数据处理装置获取聚焦阈值的两种实现方式进行具体说明。
第一种实现方式,上述步骤S203包括:接收用户对第一聚焦阈值的设定操作,在图形界面的样本分布图中显示第一聚焦阈值标记,并基于第一聚焦阈值区分正负样本的数据显示效果。
在第一种实现方式中,样本的检测数据可以为测量参数。根据测量参数具体参数的不同,该测量参数有可能是大于阈值正常,小于阈值异常。也有可能小于阈值正常,大于阈值异常。还可能是在一个范围内都是正常,在范围外都是异常。又有可能是在一个范围内是异常,在范围外是正常。用户可以根据具体参数的不同设置阈值。
可选的,上述第一聚焦阈值可以是用户设定的一个数值,也可以是用户设定的一个范围。
可选的,以用户设定的第一聚焦阈值为一个数值为例,该第一聚焦阈值包括第一数值,上述基于第一聚焦阈值区分正负样本的数据显示效果,包括:基于样本的检测数据与第一数值的大小关系区分正负样本的数据显示效果。
示例性的,数据处理装置基于样本的检测数据与第一数值的大小关系,可以将检测数据大于第一数值的样本划分为负样本,将检测数据小于第一数值的样本划分为正样本。
例如,如图3中的(a)所示,可以将异常率大于第一数值的样本划分为负样本,即图3中的(a)所示的第一数值以上的样本为负样本,用黑色圆点表示。将异常率小于第一数值的样本划分为正样本,即图3中的(a)所示的第一数值以下的样本为正样本,用灰色圆点表示。
示例性的,数据处理装置基于样本的检测数据与第一数值的大小关系,也可以将检测数据大于第一数值的样本划分为正样本,将检测数据小于第一 数值的样本划分为负样本。
再例如,如图3中的(b)所示,可以将异常率大于第一数值的样本划分为正样本,即图3中的(b)所示的第一数值以上的样本为正样本,用灰色圆点表示。将异常率小于第一数值的样本划分为负样本,即图3中的(b)所示的第一数值以下的样本为负样本,用黑色圆点表示。
需要说明的是,本公开实施例对于数据处理装置将检测数据大于第一数值的样本划分为正样本还是负样本并不进行限定,实际应用中,可以根据检测数据的具体参数类型确定将检测数据大于第一数值的样本划分为正样本还是负样本。
可选的,以用户设定的第一聚焦阈值包括两个数值为例,该第一聚焦阈值可以包括第二数值和第三数值,上述基于第一聚焦阈值区分正负样本的数据显示效果,包括:基于样本的检测数据是否大于第二数值且小于第三数值区分正负样本的数据显示效果。
示例性的,第二数值和第三数值可以组成一个范围,数据处理装置基于样本的检测数据与该范围的大小关系,可以将检测数据大于第二数值且小于第三数值的样本划分为正样本,将检测数据小于第二数值或大于第三数值的样本划分为负样本。
例如,如图4中的(a)所示,可以将异常率大于第二数值且小于第三数值的样本划分为正样本,即图4中的(a)所示的异常率在第二数值以上至第三数值以下的样本为正样本,用灰色圆点表示。将异常率小于第二数值或大于第三数值的样本划分为负样本,即图4中的(a)所示的异常率在第二数值以下,以及第三数值以上的样本为负样本,用黑色圆点表示。
示例性的,第二数值和第三数值可以组成一个范围,数据处理装置基于样本的检测数据与该范围的大小关系,可以将检测数据小于第二数值或大于第三数值的样本划分为正样本,将检测数据大于第二数值且小于第三数值的样本划分为负样本。
例如,如图4中的(b)所示,可以将异常率大于第二数值且小于第三数值的样本划分为负样本,即图4中的(b)所示的异常率在第二数值以上至第三数值以下的样本为负样本,用黑色圆点表示。将异常率小于第二数值或大于第三数值的样本划分为正样本,即图4中的(b)所示的异常率在第二数值以下,以及第三数值以上的样本为正样本,用灰色圆点表示。
第二种实现方式,上述步骤S203中获取用于划分正负样本的聚焦阈值可以包括:数据处理装置基于样本检测数据的分布情况获取聚焦阈值。例如, 数据处理装置可以基于样本检测数据的集中趋势特征如中位数、均值等作为参考聚焦阈值,并将参考聚焦阈值作为第二聚焦阈值。再例如,数据处理装置可以基于样本检测数据的集中趋势特征如中位数、均值等作为参考聚焦阈值,并基于该参考聚焦阈值划分后的检测样本分布进一步确定第二聚焦阈值。
示例性的,以样本的数量为N为例,如图5所示,上述步骤S203可以包括步骤S2031-S2033。
S2031、将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值。
可选的,该N个样本可以为经过下述步骤S205筛选后的样本,也可以是未经步骤S205筛选的样本,本公开对此并不限定。
可选的,数据处理装置将N个样本的检测数据的中位数或均值作为参考聚焦值时,参考聚焦索引可以为该参考聚焦阈值相对应的值。
例如,数据处理装置将N个样本的检测数据的中位数作为参考聚焦值时,参考聚焦索引可以为
Figure PCTCN2021097480-appb-000001
其中,
Figure PCTCN2021097480-appb-000002
表示对N/2向上取整。比如,以N为401为例,参考聚焦值为401个检测数据的中位数,参考聚焦索引为201。
再例如,数据处理装置将N个样本的检测数据的均值作为参考聚焦值时,参考聚焦索引可以为与该均值最为接近的检测数据的索引。比如,在依次排列的N个样本的检测数据中,第600个样本的检测数据与均值最接近,那么可以将参考聚焦索引确定为600。
本公开实施例对于参考聚焦值的具体确定方法并不限定,下述实施例以参考聚焦索引为
Figure PCTCN2021097480-appb-000003
参考聚焦值为第
Figure PCTCN2021097480-appb-000004
个检测数据为例进行说明。
示例性的,以样本的检测数据为异常率Ratio为例,数据处理装置可以将N个样本的异常率按照从小到大的顺序依次排列,得到数组SortedData=[x 1,x 2,x 3,…,x N],其中x i表示第i个异常率。
示例性的,以参考聚焦值取中位数为例,数据处理装置可以取样本总数的中间值作为参考聚焦索引FocusIndex。例如,如果样本总数为偶数,取N/2为参考聚焦索引FocusIndex。如果样本总数为奇数,取中间值
Figure PCTCN2021097480-appb-000005
为参考聚焦索引FocusIndex。
例如,以样本的数量N为1000,样本的检测数据为异常率Ratio为例,将1000个样本的异常率按照从小到大排序,得到数组SortedData=[x 1,x 2,x 3,…,x 100],将参考聚焦索引FocusIndex确定为500,并将数组SortedData中第500个异常率x 500作为参考聚焦值,该参考聚焦值可以将SortedData划分为LowerGroup和UpperGroup。
Figure PCTCN2021097480-appb-000006
S2032、基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。
可选的,基于参考聚焦值和N个样本的检测数据,采用AutoFocus算法可以确定第二聚焦阈值Focus。
示例性的,步骤S2032中基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值,可以包括以下步骤:
步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u
示例性的,数据处理装置根据参考聚焦值可以将N个样本的检测数据中,小于或等于参考聚焦值的检测数据划分为LowerGroup,大于参考聚焦值的检测数据划分为UpperGroup。并将LowerGroup中的检测数据求平均得到第一均值Mean l,将UpperGroup中的检测数据求平均得到第二均值Mean u
例如,以样本的数量N为1000,样本的检测数据为异常率Ratio为例,数据处理装置根据参考聚焦值x 500,将SortedData中x 1至x 500求平均得到Mean l,计算SortedData中x 500至x 1000求平均得到Mean u
步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值。
例如,以样本的数量N为1000,样本的检测数据为异常率Ratio为例,数据处理装置将SortedData中的1000个数据分别与Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l 1000],将SortedData中的1000个数据分别与Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u 1000],逐个依次比较DiffLowerMean中的第i个数值l i与DiffUpperMean中的第i个数值u i的大小。比如,依次比较DiffLowerMean中的第1个数值l 1与DiffUpperMean中的第1个数值u 1的大小,比较DiffLowerMean中的第2个数值l 2与DiffUpperMean中的第2个数值u 2的大小,比较DiffLowerMean中的第3个数值l 3与DiffUpperMean中的第3个数值u 3的大小,依次类推,确定l i<u i的数量k(以k为700为例),并将参 考聚焦索引FocusIndex确定为700,并将参考聚焦值更新为数组SortedData中第700个异常率x 700
步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
例如,继续执行步骤a,数据处理装置根据参考聚焦值x 700,将SortedData中x 1至x 700求平均得到Mean l,计算SortedData中x 700至x 1000求平均得到Mean u。步骤b、数据处理装置将SortedData中的1000个数据分别与Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l 1000],将SortedData中的1000个数据分别与Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u 1000],逐个依次比较DiffLowerMean中的第i个数值l i与DiffUpperMean中的第i个数值u i的大小。确定l i<u i的数量k(以k为750为例),并将参考聚焦索引FocusIndex确定为750,并将参考聚焦值更新为数组SortedData中第750个异常率x 750。然后再以参考聚焦值x 750,继续执行步骤a和步骤b,如果确定l i<u i的数量k仍为750,那么确定参考聚焦索引FocusIndex为750,并基于数组SortedData中第750个异常率x 750确定第二聚焦阈值。
可选的,数据处理装置基于不变的参考聚焦索引对应的检测数据确定第二聚焦阈值时,可以将该参考聚焦索引对应的检测数据确定为第二聚焦阈值。也可以将该参考聚焦索引对应的检测数据与其前一个检测数据求平均得到第二聚焦阈值。本公开对基于参考聚焦索引对应的检测数据确定第二聚焦阈值的具体方法并不限定。
例如,以更新前后参考聚焦索引的值均为750为例,数据处理装置可以将数组SortedData中第750个异常率x 750确定为第二聚焦阈值。也可以将数组SortedData中第749个异常率x 749与第750个异常率x 750求平均得到确定第二聚焦阈值。
S2033、在图形界面的样本分布图中显示第二聚焦阈值标记,并基于第二聚焦阈值区分正负样本的数据显示效果。
例如,以样本的检测数据为异常率Ratio为例,如图6所示,图形界面中显示第二聚焦阈值的标记,该第二聚焦阈值可以区分正负样本,第二聚焦阈值以上的黑色圆点为负样本,第二聚焦阈值以下的灰色圆点为正样本。
可以理解的,第二种实现方式通过数据处理装置根据样本的检测数据确定第二聚焦阈值,并基于该第二聚焦阈值进行正负样本的划分,使得根据该 正负样本进行的数据分析更准确。
可以理解的,本公开实施例通过数据处理装置基于样本的检测数据确定聚焦阈值,或者,通过接收用户对第一聚焦阈值的设置操作获取聚焦阈值,基于该聚焦阈值能够合理的划分正负样本,从而使得数据分析的准确性更高。
S204、基于正负样本,确定样本异常的原因。
示例性的,数据处理装置基于聚焦阈值划分正负样本后,基于该正负样本中的异常样本进行样本特征分析或机器学习模型的训练,能更准确的分析样本数据或训练模型。
在本公开的实施例中,基于正负样本,确定样本异常的原因包括基于正负样本进行样本特征分析,利用WOE、皮尔逊相关性分析、决策树算法等统计分析方法对引起样本异常检测结果的特征数据进行分析,从而得到特征数据对检测结果的影响程度。在本公开的另一个实施例中,基于正负样本,确定样本异常的原因还包括基于正负样本的划分,作为输入数据,利用逻辑回归、随机森林、LGBM、Xgboost、CatBoost等机器学习模型进行训练,从而获得样本异常预测模型以及样本特征数据的重要性排序。本公开对于基于正负样本,确定样本异常的原因的具体方法并不限定,在此仅是示例性说明。
本公开实施例提供的数据处理方法,数据处理装置基于样本的检测数据确定聚焦阈值,并在图形界面的样本分布图中显示基于聚焦阈值区分正负样本的数据显示效果。即本公开实施例能够合理的划分正负样本,从而根据划分的正负样本能更准确的分析样本数据或训练模型,以使得确定的样本异常原因或模型的准确度较高。
图7为本公开提供的另一数据处理方法,该方法除上述步骤S201-S204以外,还可以包括步骤S205。
S205、基于用户对过滤阈值的过滤操作,对样本数据进行筛选,并在图形界面显示筛选后的样本的分布图。
其中,过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种。
可选的,步骤S205可以在步骤S203之前执行,也可以在步骤S203之后执行,本公开对此并不限定,图7以步骤S205在步骤S203之前执行为例进行示意。可以理解的,当步骤S205在步骤S203之前执行时,数据处理装置可以基于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据确定聚焦阈值,并基于该聚焦阈值划分正负样本,再基于正负样本,确定样本异常的原因。当步骤S205在步骤S203之后执行时,数据处理装置可以基 于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据重新确定聚焦阈值,并基于该重新确定的聚焦阈值划分正负样本,再基于该正负样本,确定样本异常的原因。
可选的,上述过滤操作可以包括设定操作和选择操作。该选择操作可以包括框选操作。
可选的,上述过滤阈值中的每个阈值可以包括一个数值,也可以包括多个数值,本公开对此并不限定。
示例性的,以过滤阈值包括异常率阈值为例,该异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例。由于数据处理装置获取的样本数据量较大,因此用户可以设定异常率阈值,数据处理装置基于用户设定的异常率阈值,可以对样本数据进行筛选,过滤掉异常率低于异常率阈值的样本。可以理解的,通过删除异常率过低,没有参考价值的样本,能够提高样本分析的可靠性。
示例性的,以过滤阈值包括到达率阈值为例,该到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。由于每个样本中可能有部分子样本未到达检测站点进行检测,因此实际检测的子样本数量可能小于样本包括的子样本的总数量。故对于异常率较低的样本,有可能是因为部分子样本未被检测,导致样本的异常率较低。也就是说,在样本的到达率较低的情况下,由于该样本包括的大部分子样本未到达检测站点进行检测,因此该样本的异常率的准确度较低。为了提高样本异常率的准确度,可以将到达率较低的样本过滤掉,保留到达率较高的样本,以确保样本分析的可靠性较高。
例如,以样本为玻璃GLASS为例,每张玻璃经过各工序后可以切割成多个面板panel,各个panel再进入检测站点进行缺陷检测。每张Glass的到达率为每张Glass中到达检测站点的Panel数与切割的总Panel数的比例,每张Glass的异常率是检测的异常Panel数与切割的总Panel数的比例。为了提高异常率的准确度,避免因部分Panel未到达检测站点进行检测,而导致Glass的异常率较低的情况出现,用户可以根据经验设置到达率阈值(例如,到达率阈值为0.9),数据处理装置基于用户设定的到达率阈值,对样本数据进行筛选,过滤掉到达率低于到达率阈值0.9的样本。
示例性的,以过滤阈值包括生产设备阈值和环境参数阈值为例,为了方便用户缩小样本分析范围,用户可以设定生产设备阈值和环境参数阈值,数据处理装置基于用户设定的生产设备阈值和环境参数阈值,可以对样本数据 进行筛选,过滤掉不满足生产设备阈值和环境参数阈值的样本,保留满足生产设备阈值和环境参数阈值的样本。可以理解的,数据处理装置通过删除对分析无用的样本,可以提升诊断分析数据的纯度,提高数据分析的准确率。
例如,如图8所示,为了缩小样本分析范围,提升数据分析的可靠性,用户可以输入生产设备阈值和环境参数阈值的设定操作后,响应于用户的设定操作,数据处理装置的显示界面上显示过滤掉的样本(图8中颜色最浅的灰色圆点),并过滤掉该样本,过滤以后的样本个数及分布会发生变化,可以进一步结合步骤S203重新获取聚焦阈值。
示例性的,以过滤阈值包括检测时间阈值为例,用户可以选择检测时间阈值,数据处理装置基于用户选择的检测时间阈值,删除检测时间满足用户选择的检测时间阈值的样本。或者,也可以删除检测时间不满足用户选择的检测时间阈值的样本。
例如,如图9所示,用户输入检测时间阈值的框选操作后,响应于用户的框选操作,数据处理装置的显示界面上显示用户框选的检测时间阈值,删除检测时间满足用户选择的检测时间的样本,并基于聚焦阈值对筛选后的样本划分正负样本。
示例性的,以过滤阈值包括生成时间阈值为例,用户可以选择生成时间阈值,数据处理装置基于用户选择的生成时间,删除生成时间满足用户选择的生成时间阈值的样本。或者,也可以删除生成时间不满足用户选择的生成时间阈值的样本。
例如,如图10所示,用户输入生成时间阈值的设定操作后,响应于用户的设定操作,数据处理装置过滤掉生成时间不符合用户设定的生成时间的样本,并在显示界面上显示生成时间符合用户设定的生成时间的样本,并基于聚焦阈值对筛选后的样本划分正负样本。
可以理解的,在过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的多个阈值的情况下,数据处理装置可以基于用户设置的多个阈值,依次对样本数据进行过滤。本公开对于数据处理装置基于多个过滤阈值筛选样本的先后顺序并不限定。
本公开实施例提供的数据处理方法,数据处理装置基于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据确定聚焦阈值,并在图形界面的样本分布图中显示基于聚焦阈值区分正负样本的数据显示效果。即本公开实施例通过对样本数据进行筛选,能够过滤一部分没有参考价值或者影响样本分析结果的准确度的样本,能够提升样本数据的可靠性,使得样本分 析的结果更加可靠。而且通过合理的划分正负样本,从而根据划分的正负样本能更准确的分析样本数据或训练模型,以使得确定的样本异常原因或模型的准确度较高。
图11为本公开实施例提供的另一种数据处理方法,如图11所示,该方法包括以下步骤:
S1101、获取样本数据。
该样本数据包括样本的特征数据和检测数据。
可以理解的,步骤S1101的具体实现方式可以参考步骤S201,在此不再赘述。
S1102、基于样本的检测数据,确定聚焦阈值。
可选的,如图12所示,数据处理装置基于样本的检测数据,确定聚焦阈值,可以包括步骤S11021-S11022。
S11021、将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值。
可选的,数据处理装置将N个样本的检测数据的中位数或均值作为参考聚焦值时,参考聚焦索引可以为该参考聚焦阈值相对应的值。
例如,数据处理装置将N个样本的检测数据的中位数作为参考聚焦值时,参考聚焦索引可以为
Figure PCTCN2021097480-appb-000007
其中,
Figure PCTCN2021097480-appb-000008
表示对N/2向上取整。
再例如,数据处理装置将N个样本的检测数据的均值作为参考聚焦值时,参考聚焦索引可以为与该均值最为接近的检测数据的索引。
本公开实施例对参考聚焦值的具体确定方法并不限定,下述实施例以参考聚焦索引为
Figure PCTCN2021097480-appb-000009
参考聚焦值为第
Figure PCTCN2021097480-appb-000010
个检测数据为例进行说明。
示例性的,以样本的检测数据为异常率Ratio为例,数据处理装置可以将N个样本的异常率按照从小到大的顺序依次排列,得到数组SortedData=[x 1,x 2,x 3,…,x N],其中x i表示第i个异常率。
示例性的,数据处理装置可以取样本总数的中间值作为参考聚焦索引FocusIndex。例如,如果样本总数为偶数,取N/2为参考聚焦索引FocusIndex。如果样本总数为奇数,取中间值
Figure PCTCN2021097480-appb-000011
为参考聚焦索引FocusIndex。
S11022、基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。
可选的,基于参考聚焦值和N个样本的检测数据,采用AutoFocus算法可以确定第二聚焦阈值Focus。
示例性的,步骤S11022中基于参考聚焦值以及N个样本的检测数据,确 定第二聚焦阈值,可以包括以下步骤:
步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u
示例性的,数据处理装置根据参考聚焦值可以将N个样本的检测数据中,小于或等于参考聚焦值的检测数据划分为LowerGroup,大于参考聚焦值的检测数据划分为UpperGroup。并将LowerGroup中的检测数据求平均得到第一均值Mean l,将UpperGroup中的检测数据求平均得到第二均值Mean u
步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值。
步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
可选的,数据处理装置基于不变的参考聚焦索引对应的检测数据确定第二聚焦阈值时,可以将该参考聚焦索引对应的检测数据确定为第二聚焦阈值。也可以将该参考聚焦索引对应的检测数据与其前一个检测数据求平均得到第二聚焦阈值。本公开对基于参考聚焦索引对应的检测数据确定第二聚焦阈值的具体方法并不限定。
S1103、基于聚焦阈值,将样本划分为正负样本。
可选的,数据处理装置可以基于样本的检测数据与聚焦阈值的大小关系,将样本划分为正负样本。
示例性的,在聚焦阈值可以为一个数值的情况下,数据处理装置可以基于样本的检测数据与聚焦阈值的大小关系,将检测数据大于该聚焦阈值的样本划分为负样本,将检测数据小于该聚焦阈值的样本划分为正样本。数据处理装置也可以基于样本的检测数据与聚焦阈值的大小关系,将检测数据大于该聚焦阈值的样本划分为正样本,将检测数据小于该聚焦阈值的样本划分为负样本。
示例性的,在聚焦阈值可以为一个数值范围的情况下,数据处理装置可 以基于样本的检测数据是否在该数值范围内,将检测数据在该数值范围内的样本划分为负样本,将检测数据在该数值范围外的样本划分为正样本。数据处理装置也可以基于样本的检测数据是否在该数值范围内,将检测数据在该数值范围内的样本划分为正样本,将检测数据在该数值范围外的样本划分为负样本。
S1104、基于正负样本,确定样本异常的原因。
示例性的,数据处理装置基于聚焦阈值划分正负样本后,基于该正负样本中的异常样本进行样本特征分析或机器学习模型的训练,能更准确的分析样本数据或训练模型。
在本公开的实施例中,基于正负样本,确定样本异常的原因包括基于正负样本进行样本特征分析,利用WOE、皮尔逊相关性分析、决策树算法等统计分析方法对引起样本异常检测结果的特征数据进行分析,从而得到特征数据对检测结果的影响程度。在本公开的另一个实施例中,基于正负样本,确定样本异常的原因还包括基于正负样本的划分,作为输入数据,利用逻辑回归、随机森林、LGBM、Xgboost、CatBoost等机器学习模型进行训练,从而获得样本异常预测模型以及样本特征数据的重要性排序。
本公开实施例提供的数据处理方法,数据处理装置基于样本的检测数据确定聚焦阈值,基于该聚焦阈值,能够合理的划分正负样本,从而根据划分的正负样本能更准确的分析样本数据或训练模型,以使得确定的样本异常原因或模型的准确度较高。
图13为本公开提供的另一数据处理方法,该方法除上述步骤S1101-S1104以外,还可以包括步骤S1105。
S1105、基于过滤阈值,对样本数据进行筛选。
其中,过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种。
可选的,步骤S1105可以在步骤S1102之前执行,也可以在步骤S1102之后执行,本公开对此并不限定,图13以步骤S1105在步骤S1102之前执行为例进行示意。可以理解的,当步骤S1105在步骤S1102之前执行时,数据处理装置可以基于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据确定聚焦阈值,并基于该聚焦阈值划分正负样本,再基于正负样本,确定样本异常的原因。当步骤S1105在步骤S1102之后执行时,数据处理装置可以基于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据重新确定聚焦阈值,并基于该重新确定的聚焦阈值划分正负样本,再基于 该正负样本,确定样本异常的原因。
可以理解的,步骤S1105的具体实现方式可以参考步骤S205,在此不再赘述。
可以理解的,在过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的多个阈值的情况下,数据处理装置可以基于用户设置的多个阈值,依次对样本数据进行过滤。本公开对于数据处理装置基于多个过滤阈值筛选样本的先后顺序并不限定。
本公开实施例提供的数据处理方法,数据处理装置基于过滤阈值对样本数据进行筛选,并基于筛选以后的样本的检测数据确定聚焦阈值,基于聚焦阈值划分正负样本。即本公开实施例通过对样本数据进行筛选,能够过滤一部分没有参考价值或者影响样本分析结果的准确度的样本,能够提升样本数据的可靠性,使得样本分析的结果更加可靠。而且通过合理的划分正负样本,从而根据划分的正负样本能更准确的分析样本数据或训练模型,以使得确定的样本异常原因或模型的准确度较高。
上述主要从方法的角度对本公开实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本公开能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
本公开实施例还提供一种数据处理装置。如图14所示,为本公开实施例提供的一种数据处理装置的结构图。数据处理装置140用于执行上述实施例中任一实施例所述的数据处理方法。数据处理装置140可以包括:获取模块141、显示模块142、确定模块143和筛选模块144。
其中,获取模块141,用于响应于用户在图形界面的输入操作,获取样本数据。该样本数据包括样本的特征数据和检测数据。显示模块142,用于基于样本数据,在图形界面显示样本分布图。获取模块141,还用于获取用于划分正负样本的聚焦阈值。显示模块142,还用于基于获取模块141获取的聚焦阈值,在图形界面的样本分布图中显示聚焦阈值标记,并基述聚焦阈值区分正负样本的数据显示效果。其中,聚焦阈值基于样本的检测数据确定。确定模块143,用于基于正负样本,确定样本异常的原因。
可选的,样本的特征数据包括产品型号、检测站点、异常类型、到达率、 生产设备、环境参数、检测时间、或生成时间中的至少一种。
可选的,样本的检测数据包括异常率或测量参数中的至少一种。
在一些实施例中,聚焦阈值包括第二聚焦阈值,以样本的数量为N为例,获取模块141,具体用于将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。显示模块142,具体用于在图形界面的样本分布图中显示第二聚焦阈值标记,并基于第二聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,获取模块141,具体还用于执行以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u。步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值。步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,聚焦阈值包括第一聚焦阈值,该第一聚焦阈值为一个或多个。获取模块141,具体还用于接收用户对第一聚焦阈值的设定操作。显示模块142,还用于在图形界面的样本分布图中显示第一聚焦阈值标记,并基于第一聚焦阈值区分正负样本的数据显示效果。
另一些实施例中,第一聚焦阈值包括第一数值,显示模块142,具体用于基于样本的检测数据与第一数值的大小关系区分正负样本的数据显示效果。
另一些实施例中,第一聚焦阈值包括第二数值和第三数值,第二数值小于第三数值,显示模块142,具体还用于基于样本的检测数据是否大于第二数值且小于第三数值区分正负样本的数据显示效果。
另一些实施例中,筛选模块144,用于基于用户对过滤阈值的过滤操作,对样本数据进行筛选。显示模块142,还用于在图形界面显示筛选后的样本的分布图。
其中,过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;样本包括多个子样 本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
可选的,过滤操作包括设定操作和选择操作。
当然,本公开实施例提供的数据处理装置140包括但不限于上述模块。
本公开实施例还提供一种数据处理装置。如图15所示,为本公开实施例提供的一种数据处理装置的结构图。数据处理装置150用于执行上述实施例中任一实施例的数据处理方法。数据处理装置150可以包括:获取模块151、确定模块152、划分模块153和筛选模块154。
其中,获取模块151,用于获取样本数据,该样本数据包括样本的特征数据和检测数据。确定模块152,用于基于样本的检测数据,确定聚焦阈值。划分模块153,用于基于确定模块确定的聚焦阈值,将样本划分为正负样本。确定模块152,还用于基于正负样本,确定样本异常的原因。
可选的,样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
可选的,样本的检测数据包括异常率或测量参数中的至少一种。
在一些实施例中,聚焦阈值包括第二聚焦阈值,样本的数量为N,确定模块152,具体用于:将N个样本的检测数据按照从小到大依次排列,并将N个样本的检测数据的中位数或均值作为参考聚焦值;基于参考聚焦值以及N个样本的检测数据,确定第二聚焦阈值。
另一些实施例中,确定模块152,具体还用于执行以下步骤:步骤a、将N个样本的检测数据中小于或等于参考聚焦值的检测数据求平均得到第一均值Mean l,将N个样本的检测数据中大于参考聚焦值的检测数据求平均得到第二均值Mean u。步骤b、将依次排列的N个样本的检测数据逐个与第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的N个样本的检测数据逐个与第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较第一均差和第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的N个样本的检测数据中,将参考聚焦值更新为第k个检测数据的值。步骤c、重复步骤a和步骤b,直至更新前后参考聚焦索引的值不变,在依次排列的N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定第二聚焦阈值。
另一些实施例中,筛选模块154,用于基于过滤阈值,对样本数据进行筛选。其中,过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参 数阈值、检测时间阈值、或生成时间阈值中的至少一种;样本包括多个子样本,异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
本公开另一实施例还提供一种数据处理装置。如图16所示,数据处理装置160包括存储器161和处理器162;存储器161和处理器162耦合;存储器161用于存储计算机程序代码,计算机程序代码包括计算机指令。其中,当处理器162执行计算机指令时,使得数据处理装置160执行上述方法实施例所示的方法流程中数据处理装置执行的各个步骤。
在实际实现时,获取模块141、显示模块142、确定模块143和筛选模块144可以由图16所示的处理器162调用存储器161中的计算机程序代码来实现。其具体的执行过程可参考图2、图3、图7所示的数据处理方法部分的描述,这里不再赘述。
本公开另一实施例还提供一种数据处理装置。如图17所示,数据处理装置170包括存储器171和处理器172;存储器171和处理器172耦合;存储器171用于存储计算机程序代码,计算机程序代码包括计算机指令。其中,当处理器172执行计算机指令时,使得数据处理装置170执行上述方法实施例所示的方法流程中数据处理装置执行的各个步骤。
在实际实现时,获取模块151、确定模块152、划分模块153和筛选模块154可以由图17所示的处理器172调用存储器171中的计算机程序代码来实现。其具体的执行过程可参考图11、图12、图13所示的数据处理方法部分的描述,这里不再赘述。
本公开的一些实施例提供了一种计算机可读存储介质(例如,非暂态计算机可读存储介质),该计算机可读存储介质中存储有计算机程序指令,计算机程序指令在处理器上运行时,使得处理器执行如上述实施例中任一实施例所述的数据处理方法中的一个或多个步骤。
示例性的,上述计算机可读存储介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,CD(Compact Disk,压缩盘)、DVD(Digital Versatile Disk,数字通用盘)等),智能卡和闪存器件(例如,EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、卡、棒或钥匙驱动器等)。本公开描述的各种计算机可读存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读存储介质。术语“机器可读存储介质”可包括但不限于,无线信道和能够存储、包含和/或承载指 令和/或数据的各种其它介质。
本公开的一些实施例还提供了一种计算机程序产品。该计算机程序产品包括计算机程序指令,在计算机上执行该计算机程序指令时,该计算机程序指令使计算机执行如上述实施例所述的数据处理方法中的一个或多个步骤。
本公开的一些实施例还提供了一种计算机程序。当该计算机程序在计算机上执行时,该计算机程序使计算机执行如上述实施例所述的数据处理方法中的一个或多个步骤。
上述计算机可读存储介质、计算机程序产品及计算机程序的有益效果和上述一些实施例所述的数据处理方法的有益效果相同,此处不再赘述。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (39)

  1. 一种数据处理方法,所述方法包括:
    响应于用户在图形界面的输入操作,获取样本数据,所述样本数据包括样本的特征数据和检测数据;
    基于所述样本数据,在图形界面显示样本分布图;
    获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示所述聚焦阈值标记,并基于所述聚焦阈值区分正负样本的数据显示效果;其中,所述聚焦阈值基于所述样本的检测数据确定;
    基于所述正负样本,确定所述样本异常的原因。
  2. 根据权利要求1所述的方法,所述聚焦阈值包括第一聚焦阈值,所述第一聚焦阈值为一个或多个,所述获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示所述聚焦阈值标记,并基于所述聚焦阈值区分正负样本的数据显示效果,包括:
    接收用户对所述第一聚焦阈值的设定操作,在图形界面的样本分布图中显示所述第一聚焦阈值标记,并基于所述第一聚焦阈值区分正负样本的数据显示效果。
  3. 根据权利要求2所述的方法,所述第一聚焦阈值包括第一数值,所述基于所述第一聚焦阈值区分正负样本的数据显示效果,包括:
    基于所述样本的检测数据与所述第一数值的大小关系区分所述正负样本的数据显示效果。
  4. 根据权利要求2所述的方法,所述第一聚焦阈值包括第二数值和第三数值,所述第二数值小于所述第三数值,所述基于所述第一聚焦阈值区分正负样本的数据显示效果,包括:
    基于所述样本的检测数据是否大于所述第二数值且小于所述第三数值区分所述正负样本的数据显示效果。
  5. 根据权利要求1-4中任一项所述的方法,所述方法还包括:
    基于用户对过滤阈值的过滤操作,对所述样本数据进行筛选,并在图形界面显示筛选后的样本的分布图。
  6. 根据权利要求5所述的方法,所述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;所述样本包括多个子样本,所述异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;所述到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
  7. 根据权利要求5或6所述的方法,所述过滤操作包括设定操作和选择 操作。
  8. 根据权利要求1-7中任一项所述的方法,所述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
  9. 根据权利要求1-8中任一项所述的方法,所述样本的检测数据包括异常率或测量参数中的至少一种。
  10. 根据权利要求1-9中任一项所述的方法,所述聚焦阈值还包括第二聚焦阈值,所述样本的数量为N,所述获取用于划分正负样本的聚焦阈值,在图形界面的样本分布图中显示所述聚焦阈值标记,并基于所述聚焦阈值区分正负样本的数据显示效果,包括:
    将N个所述样本的检测数据按照从小到大依次排列,并将N个所述样本的检测数据的中位数或均值作为参考聚焦值;
    基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值;
    在图形界面的样本分布图中显示所述第二聚焦阈值标记,并基于所述第二聚焦阈值区分正负样本的数据显示效果。
  11. 根据权利要求10所述的方法,所述基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值,包括以下步骤:
    步骤a、将所述N个样本的检测数据中小于或等于所述参考聚焦值的检测数据求平均得到第一均值Mean l,将所述N个样本的检测数据中大于所述参考聚焦值的检测数据求平均得到第二均值Mean u
    步骤b、将依次排列的所述N个样本的检测数据逐个与所述第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的所述N个样本的检测数据逐个与所述第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较所述第一均差和所述第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的所述N个样本的检测数据中,将所述参考聚焦值更新为第k个检测数据的值;
    步骤c、重复步骤a和步骤b,直至更新前后所述参考聚焦索引的值不变,在依次排列的所述N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定所述第二聚焦阈值。
  12. 一种数据处理方法,所述方法包括:
    获取样本数据,所述样本数据包括样本的特征数据和检测数据;
    基于所述样本的检测数据,确定聚焦阈值;
    基于所述聚焦阈值,将所述样本划分为正负样本;
    基于所述正负样本,确定所述样本异常的原因。
  13. 根据权利要求12所述的方法,所述聚焦阈值包括第二聚焦阈值,所述样本的数量为N,所述根据所述样本的检测数据,确定聚焦阈值,包括:
    将N个所述样本的检测数据按照从小到大依次排列,并将N个所述样本的检测数据的中位数或均值作为参考聚焦值;
    基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值。
  14. 根据权利要求13所述的方法,所述基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值,包括以下步骤:
    步骤a、将所述N个样本的检测数据中小于或等于所述参考聚焦值的检测数据求平均得到第一均值Mean l,将所述N个样本的检测数据中大于所述参考聚焦值的检测数据求平均得到第二均值Mean u
    步骤b、将依次排列的所述N个样本的检测数据逐个与所述第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的所述N个样本的检测数据逐个与所述第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较所述第一均差和所述第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的所述N个样本的检测数据中,将所述参考聚焦值更新为第k个检测数据的值;
    步骤c、重复步骤a和步骤b,直至更新前后所述参考聚焦索引的值不变,在依次排列的所述N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定所述第二聚焦阈值。
  15. 根据权利要求12-14中任一项所述的方法,所述方法还包括:
    基于过滤阈值,对所述样本数据进行筛选。
  16. 根据权利要求15所述的方法,所述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;所述样本包括多个子样本,所述异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;所述到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
  17. 根据权利要求12-16中任一项所述的方法,所述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、 或生成时间中的至少一种。
  18. 根据权利要求12-17中任一项所述的方法,所述样本的检测数据包括异常率或测量参数中的至少一种。
  19. 一种数据处理装置,所述装置包括:
    获取模块,用于响应于用户在图形界面的输入操作,获取样本数据,所述样本数据包括样本的特征数据和检测数据;
    显示模块,用于基于所述获取模块获取的样本数据,在图形界面显示样本分布图;
    所述获取模块,还用于获取用于划分正负样本的聚焦阈值;
    所述显示模块,还用于基于所述获取模块获取的所述聚焦阈值,在图形界面的样本分布图中显示所述聚焦阈值标记,并基于所述聚焦阈值区分正负样本的数据显示效果;其中,所述聚焦阈值基于所述样本的检测数据确定;
    确定模块,用于基于所述正负样本,确定所述样本异常的原因。
  20. 根据权利要求19所述的装置,所述聚焦阈值包括第一聚焦阈值,所述第一聚焦阈值为一个或多个;
    所述获取模块,具体还用于接收用户对所述第一聚焦阈值的设定操作;
    所述显示模块,还用于在图形界面的样本分布图中显示所述第一聚焦阈值标记,并基于所述第一聚焦阈值区分正负样本的数据显示效果。
  21. 根据权利要求20所述的装置,所述第一聚焦阈值包括第一数值,所述显示模块,具体用于:
    基于所述样本的检测数据与所述第一数值的大小关系区分所述正负样本的数据显示效果。
  22. 根据权利要求20所述的装置,所述第一聚焦阈值包括第二数值和第三数值,所述第二数值小于所述第三数值,所述显示模块,具体还用于:
    基于所述样本的检测数据是否大于所述第二数值且小于所述第三数值区分所述正负样本的数据显示效果。
  23. 根据权利要求19-22中任一项所述的装置,所述数据处理装置还包括筛选模块;
    所述筛选模块,用于基于用户对过滤阈值的过滤操作,对所述样本数据进行筛选;
    所述显示模块,还用于在图形界面显示筛选后的样本的分布图。
  24. 根据权利要求23所述的装置,所述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中 的至少一种;所述样本包括多个子样本,所述异常率用于指示每个样本中异常子样本的数量占该样本包括的子样本总数量的比例;所述到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
  25. 根据权利要求23或24所述的装置,所述过滤操作包括设定操作和选择操作。
  26. 根据权利要求19-25中任一项所述的装置,所述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
  27. 根据权利要求19-26中任一项所述的装置,所述样本的检测数据包括异常率或测量参数中的至少一种。
  28. 根据权利要求19-27中任一项所述的装置,所述聚焦阈值还包括第二聚焦阈值,所述样本的数量为N,所述获取模块,具体用于:
    将N个所述样本的检测数据按照从小到大依次排列,并将N个所述样本的检测数据的中位数或均值作为参考聚焦值;
    基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值;
    在图形界面的样本分布图中显示所述第二聚焦阈值标记,并基于所述第二聚焦阈值区分正负样本的数据显示效果。
  29. 根据权利要求28所述的装置,所述获取模块,具体还用于执行以下步骤:
    步骤a、将所述N个样本的检测数据中小于或等于所述参考聚焦值的检测数据求平均得到第一均值Mean l,将所述N个样本的检测数据中大于所述参考聚焦值的检测数据求平均得到第二均值Mean u
    步骤b、将依次排列的所述N个样本的检测数据逐个与所述第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的所述N个样本的检测数据逐个与所述第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较所述第一均差和所述第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的所述N个样本的检测数据中,将所述参考聚焦值更新为第k个检测数据的值;
    步骤c、重复步骤a和步骤b,直至更新前后所述参考聚焦索引的值不变,在依次排列的所述N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定所述第二聚焦阈值。
  30. 一种数据处理装置,所述装置包括:
    获取模块,用于获取样本数据,所述样本数据包括样本的特征数据和检测数据;
    确定模块,用于基于所述样本的检测数据,确定聚焦阈值;
    划分模块,用于基于所述聚焦阈值,将所述样本划分为正负样本;
    所述确定模块,还用于基于所述正负样本,确定所述样本异常的原因。
  31. 根据权利要求30所述的装置,所述聚焦阈值包括第二聚焦阈值,所述样本的数量为N,所述确定模块,具体用于:
    将N个所述样本的检测数据按照从小到大依次排列,并将N个所述样本的检测数据的中位数或均值作为参考聚焦值;
    基于所述参考聚焦值以及所述N个所述样本的检测数据,确定所述第二聚焦阈值。
  32. 根据权利要求31所述的装置,所述确定模块,具体还用于执行以下步骤:
    步骤a、将所述N个样本的检测数据中小于或等于所述参考聚焦值的检测数据求平均得到第一均值Mean l,将所述N个样本的检测数据中大于所述参考聚焦值的检测数据求平均得到第二均值Mean u
    步骤b、将依次排列的所述N个样本的检测数据逐个与所述第一均值Mean l作差并取绝对值,得到第一均差DiffLowerMean=[l 1,l 2,l 3…,l N],将依次排列的所述N个样本的检测数据逐个与所述第二均值Mean u作差并取绝对值,得到第二均差DiffUpperMean=[u 1,u 2,u 3…,u N],逐个比较所述第一均差和所述第二均差,确定l i<u i的数量k,i=1,2,3,…,N,将参考聚焦索引更新为k,并在依次排列的所述N个样本的检测数据中,将所述参考聚焦值更新为第k个检测数据的值;
    步骤c、重复步骤a和步骤b,直至更新前后所述参考聚焦索引的值不变,在依次排列的所述N个样本的检测数据中,基于该参考聚焦索引对应的检测数据确定所述第二聚焦阈值。
  33. 根据权利要求30-32中任一项所述的装置,所述数据处理装置还包括筛选模块,所述筛选模块,用于:
    基于过滤阈值,对所述样本数据进行筛选。
  34. 根据权利要求33所述的装置,所述过滤阈值包括异常率阈值、到达率阈值、生产设备阈值、环境参数阈值、检测时间阈值、或生成时间阈值中的至少一种;所述样本包括多个子样本,所述异常率用于指示每个样本中异 常子样本的数量占该样本包括的子样本总数量的比例;所述到达率用于指示每个样本中实际检测的子样本数量占该样本包括的子样本总数量的比例。
  35. 根据权利要求30-34中任一项所述的装置,所述样本的特征数据包括产品型号、检测站点、异常类型、到达率、生产设备、环境参数、检测时间、或生成时间中的至少一种。
  36. 根据权利要求30-35中任一项所述的装置,所述样本的检测数据包括异常率或测量参数中的至少一种。
  37. 一种数据处理装置,所述装置包括存储器和处理器;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;
    其中,当所述处理器执行所述计算机指令时,使得所述装置执行如权利要求1-18中任一项所述的数据处理方法。
  38. 一种非瞬态的计算机可读存储介质,所述计算机可读存储介质存储有计算机程序;其中,所述计算机程序在数据处理装置运行时,使得所述数据处理装置执行如权利要求1-18中任一项所述的数据处理方法。
  39. 一种计算机程序产品,所述计算机程序产品包括计算机程序,在数据处理装置上执行所述计算机程序时,使得所述数据处理装置执行如权利要求1-18中任一项所述的数据处理方法。
PCT/CN2021/097480 2021-05-31 2021-05-31 数据处理方法及装置 WO2022252079A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/097480 WO2022252079A1 (zh) 2021-05-31 2021-05-31 数据处理方法及装置
CN202180001379.6A CN115943372A (zh) 2021-05-31 2021-05-31 数据处理方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/097480 WO2022252079A1 (zh) 2021-05-31 2021-05-31 数据处理方法及装置

Publications (1)

Publication Number Publication Date
WO2022252079A1 true WO2022252079A1 (zh) 2022-12-08

Family

ID=84322687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097480 WO2022252079A1 (zh) 2021-05-31 2021-05-31 数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN115943372A (zh)
WO (1) WO2022252079A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194430A (zh) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 一种样本筛选方法及装置,电子设备
EP3537349A1 (en) * 2018-01-11 2019-09-11 Huawei Technologies Co., Ltd. Machine learning model training method and device
US20190354860A1 (en) * 2016-12-14 2019-11-21 Conti Temic Microelectronic Gmbh Device for Classifying Data
CN111325260A (zh) * 2020-02-14 2020-06-23 北京百度网讯科技有限公司 数据处理方法及装置、电子设备、计算机可读介质
CN111460991A (zh) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 异常检测方法、相关设备及可读存储介质
CN112052915A (zh) * 2020-09-29 2020-12-08 中国银行股份有限公司 一种数据训练方法、装置、设备及存储介质
CN112529109A (zh) * 2020-12-29 2021-03-19 四川长虹电器股份有限公司 一种基于无监督多模型的异常检测方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354860A1 (en) * 2016-12-14 2019-11-21 Conti Temic Microelectronic Gmbh Device for Classifying Data
CN107194430A (zh) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 一种样本筛选方法及装置,电子设备
EP3537349A1 (en) * 2018-01-11 2019-09-11 Huawei Technologies Co., Ltd. Machine learning model training method and device
CN111325260A (zh) * 2020-02-14 2020-06-23 北京百度网讯科技有限公司 数据处理方法及装置、电子设备、计算机可读介质
CN111460991A (zh) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 异常检测方法、相关设备及可读存储介质
CN112052915A (zh) * 2020-09-29 2020-12-08 中国银行股份有限公司 一种数据训练方法、装置、设备及存储介质
CN112529109A (zh) * 2020-12-29 2021-03-19 四川长虹电器股份有限公司 一种基于无监督多模型的异常检测方法及系统

Also Published As

Publication number Publication date
CN115943372A (zh) 2023-04-07

Similar Documents

Publication Publication Date Title
CN112288192A (zh) 一种环保监测预警方法及系统
CN111343147B (zh) 一种基于深度学习的网络攻击检测装置及方法
CN110046633B (zh) 一种数据质量检测方法及装置
CN113837596B (zh) 一种故障确定方法、装置、电子设备及存储介质
US11580425B2 (en) Managing defects in a model training pipeline using synthetic data sets associated with defect types
CN107942956A (zh) 信息处理装置、信息处理方法、信息处理程序及记录介质
US20220092359A1 (en) Image data classification method, device and system
CN110580217B (zh) 软件代码健康度的检测方法、处理方法、装置及电子设备
JP6060209B2 (ja) 品質管理物質の統計学的に有効な分析平均値および分析範囲を得るシステムおよび方法
CN115794916A (zh) 多源数据融合的数据处理方法、装置、设备和存储介质
CN115422028A (zh) 标签画像体系的可信度评估方法、装置、电子设备及介质
KR20190060548A (ko) 변수 구간별 불량 발생 지수를 도출하여 공정 불량 원인을 파악하고 시각화하는 방법
Kirichenko et al. Generalized approach to Hurst exponent estimating by time series
CN114648060A (zh) 基于机器学习的故障信号规范化处理及分类方法
WO2022252079A1 (zh) 数据处理方法及装置
CN113723467A (zh) 用于缺陷检测的样本收集方法、装置和设备
CN117593115A (zh) 信贷风险评估模型的特征值确定方法、装置、设备和介质
CN111121946B (zh) 大动态范围大离散单区域多点精准确定异常值的方法
CN107291767B (zh) 任务执行时间的优化处理方法和装置
CN115277261B (zh) 基于工控网络病毒的异常机器智能识别方法、装置、设备
US20240193460A1 (en) Data processing method and data processing apparatus
CN115546108A (zh) 基于边云协同和ar的汽车轮胎外观质量智能检测方法
CN113962558A (zh) 一种基于生产数据管理的工业互联网平台评价方法及系统
CN112149546B (zh) 一种信息处理方法、装置、电子设备及存储介质
CN111400644B (zh) 一种用于实验室分析样品的计算处理方法

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 17908478

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21943455

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE