WO2023060954A1 - 数据处理与数据质检方法、装置及可读存储介质 - Google Patents
数据处理与数据质检方法、装置及可读存储介质 Download PDFInfo
- Publication number
- WO2023060954A1 WO2023060954A1 PCT/CN2022/105122 CN2022105122W WO2023060954A1 WO 2023060954 A1 WO2023060954 A1 WO 2023060954A1 CN 2022105122 W CN2022105122 W CN 2022105122W WO 2023060954 A1 WO2023060954 A1 WO 2023060954A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- processed
- prediction
- forgetting
- result
- Prior art date
Links
- 238000007689 inspection Methods 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000003672 processing method Methods 0.000 title claims abstract description 11
- 238000003062 neural network model Methods 0.000 claims abstract description 59
- 238000002372 labelling Methods 0.000 claims description 51
- 238000012545 processing Methods 0.000 claims description 48
- 238000012549 training Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 7
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 8
- 238000012216 screening Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Definitions
- the present disclosure relates to the field of computer technology, especially to the field of artificial intelligence technology such as cloud service and deep learning, and specifically provides a data processing and data quality inspection method, device, electronic equipment, and readable storage medium.
- Data annotation quality inspection refers to the inspection of the quality of the labeled data.
- AI artificial intelligence
- AI companies or individual developers have a growing demand for data.
- the quality of data labeling has a huge impact on the performance of AI algorithms. Only a large amount of objective and accurate labeled data can help improve the performance of AI algorithms.
- manual quality inspection is usually used to observe and judge whether there is a labeling error in the labeled data.
- this method of manual quality inspection may cause errors due to a large amount of data, manual negligence, etc., and the cost of manual quality inspection is relatively high.
- this disclosure proposes a data processing and data quality
- the inspection method is used to realize the purpose of automatically screening the data to be inspected from the data to be processed, which can reduce the cost of obtaining the data to be inspected and improve the efficiency and accuracy of obtaining the data to be inspected.
- a data processing method including: acquiring at least one data to be processed, the at least one data to be processed is marked data to be processed; using the target neural network model to process the at least one
- the data to be processed is predicted a preset number of times to obtain the prediction result of the at least one data to be processed in each prediction; according to the labeling result of the at least one data to be processed and the at least one data to be processed in each prediction generate the comparison result sequence of the at least one data to be processed; and determine the data to be quality checked in the at least one data to be processed according to the comparison result sequence of the at least one data to be processed.
- a data quality inspection method including: acquiring data to be inspected according to a data processing method; performing quality inspection on the data to be inspected to obtain a quality inspection result.
- a data processing device including: a first acquiring unit, configured to acquire at least one piece of data to be processed, where the at least one piece of data to be processed is labeled data to be processed; a prediction unit, It is used to use the target neural network model to perform a preset number of predictions on the at least one data to be processed to obtain a prediction result of each prediction of the at least one data to be processed; the generating unit is configured to The labeling result of the processed data and the prediction result of the at least one data to be processed in each prediction are generated to generate a sequence of comparison results of the at least one data to be processed; the processing unit is configured to, according to the at least one data to be processed Comparing the result sequences to determine the data to be quality checked in the at least one data to be processed.
- a data quality inspection device including: a second acquisition unit, configured to acquire data to be inspected according to a data processing device; a quality inspection unit, configured to perform quality inspection on the data to be inspected Quality inspection, get the quality inspection result.
- an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
- a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above method.
- a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
- the present disclosure achieves the purpose of automatically screening the data to be processed to obtain the data to be inspected, can reduce the cost of obtaining the data to be inspected, and improve the efficiency and accuracy of obtaining the data to be inspected.
- FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure
- FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure.
- Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure.
- FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure.
- FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure.
- FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure.
- Fig. 7 is a block diagram of an electronic device used to implement the data processing or data quality inspection method of the embodiment of the present disclosure.
- FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in Figure 1, the data processing method of the present embodiment specifically includes the following steps:
- the data processing method of this embodiment after using the target neural network model to predict the acquired at least one data to be processed for a preset number of times, according to the labeling result of the at least one data to be processed and the at least one data to be processed at each prediction
- the prediction result of the data to be processed generates at least one comparison result sequence of the data to be processed, and then according to the comparison result sequence of the at least one data to be processed, the data to be quality inspected is determined from the at least one data to be processed.
- This embodiment realizes automatically from The purpose of screening the data to be processed to obtain the data to be inspected can reduce the cost of obtaining the data to be inspected and improve the efficiency and accuracy of obtaining the data to be inspected.
- the execution subject of the data processing method in this embodiment may be a cloud server or a terminal device.
- the at least one piece of data to be processed acquired by executing S101 is data such as images, texts, and audios that have been manually or automatically marked, that is, besides the original data, it also includes the marking results of the original data.
- the labeling result of the data to be processed in this embodiment may be a category recognition result, an object recognition result, a text recognition result, and the like.
- At least one piece of data input from the input terminal may be used as at least one piece of data to be processed; At least one piece of data corresponding to the search request is used as at least one piece of data to be processed.
- an optional implementation method that can be adopted is: obtain a quality inspection request, the quality inspection request is sent by the input end, and includes data identification information, and the data identification information can be The ID of the data set, etc.; at least one piece of data corresponding to the acquired data identification information is used as at least one piece of data to be processed, for example, all the data in the data set corresponding to the data identification information is used as at least one piece of data to be processed.
- different data are pre-stored in the database, and after obtaining the quality inspection request sent by the input terminal, the data corresponding to the quality inspection request in the database is used as the data to be processed, and no input terminal is required for data input.
- the operation simplifies the operation steps at the input end, thereby improving the efficiency when obtaining the data to be inspected.
- the at least one piece of data to be processed acquired by executing S101 has the same data type, for example, the data type of the at least one piece of data to be processed acquired is one of image, text, and audio.
- the target neural network model when executing S102, is firstly determined, and then the target neural network model is used to obtain at least one prediction result of the data to be processed.
- the target neural network model when executing S102, can be determined according to the obtained quality inspection request, that is, the obtained quality inspection request further includes model type information in addition to data identification information, such as target detection type, Text recognition type, image classification type, etc., use the neural network model corresponding to the model type information as the target neural network model.
- the target neural network model when executing S102, can also be determined in the following manner: according to the labeling result of at least one data to be processed, determine the task information used to characterize the training task of the neural network model, and the training of the neural network model
- the tasks may include target detection tasks, text recognition tasks, image classification tasks, etc.; the neural network model corresponding to the determined task information is used as the target neural network model.
- different neural network models in this embodiment are used to complete different training tasks.
- this embodiment can determine the target neural network model according to the labeling results of the data to be processed without obtaining the model type information from the quality inspection request sent by the input end, thereby further improving the Intelligence and efficiency when checking data.
- the target neural network model after performing S102 to determine the target neural network model, can be used to perform a preset number of trainings on at least one data to be processed, so as to obtain the prediction result of at least one neural network model in each prediction .
- the preset number of times when executing S102 in this embodiment can be a preset number of times; it is also possible to use the number of training times corresponding to the task information of at least one data to be processed as the preset number of times according to the correspondence between task information and training times frequency.
- a distributed training method when performing S102 using the target neural network model to predict at least one data to be processed for a preset number of times, a distributed training method can be used, that is, multiple nodes use the target neural network model to predict different data to be processed The data is predicted, and each node saves the prediction result of the corresponding data to be processed, and records the number of training times and the node serial number at the same time.
- the available The optional implementation method is: compare the labeling result of at least one data to be processed with the prediction result of at least one data to be processed in each prediction, and obtain the at least one data to be processed in each prediction to represent the correctness of the prediction Or predict the wrong comparison result, if the prediction result is consistent with the labeling result, get the comparison result used to represent the correct prediction, otherwise get the comparison result used to represent the wrong prediction; according to at least one data to be processed in each prediction is used to characterize the comparison result of correct prediction or wrong prediction, and generate at least one comparison result sequence of the data to be processed.
- the comparison result sequence generated in this embodiment can reflect the prediction of at least one data to be processed when the target neural network model is trained for a preset number of times, so as to realize the determination of the data to be processed according to the comparison result sequence.
- the purpose of labeling quality of processing data is to reflect the prediction of at least one data to be processed when the target neural network model is trained for a preset number of times, so as to realize the determination of the data to be processed according to the comparison result sequence.
- the comparison result sequence of data 1 generated by executing S103 in this embodiment is ⁇ prediction is correct, prediction is wrong, prediction is wrong, prediction is correct, prediction is wrong, prediction mistake ⁇ .
- the number of data to be inspected determined by executing S104 may be one or multiple.
- the number of prediction errors in the comparison result sequence can be determined, and then the data to be processed with the number of prediction errors exceeding the preset threshold as data to be checked.
- this embodiment determines the labeling quality of the data to be processed according to the generated comparison result sequence, and uses the data to be processed with poor labeling quality (multiple travel prediction errors) as the data to be quality checked, so as to achieve from at least one
- the purpose of screening the data to be processed is to obtain the data to be inspected, and then return the determined data to be inspected to the input end, so that the input end can confirm or relabel the data to be inspected.
- FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in Figure 2, the present embodiment specifically includes the following steps when performing "S104 determining the data to be quality checked in the at least one data to be processed according to the comparison result sequence of the at least one data to be processed":
- the "number of times of forgetting" of the data to be processed refers to the number of times the preset sequence of comparison results appears in the sequence of comparison results obtained after the data to be processed has undergone multiple predictions by the target neural network model .
- the number of times of forgetting of the data to be processed is obtained according to the comparison result sequence of the data to be processed, and then the obtained number of times of forgetting is used to determine the data to be inspected from at least one data to be processed, and the number of times of forgetting is obtained by comparing the sequence of the results
- the number of times can improve the accuracy of the determined data to be inspected.
- an optional implementation method that can be adopted is: counting the comparison of at least one data to be processed In the result sequence, the number of times the preset comparison result sequence appears; the counted number of times is used as the number of forgetting times of at least one data to be processed.
- the comparison result sequence of the data to be processed is ⁇ correct prediction, wrong prediction, correct prediction, correct prediction, wrong prediction, wrong prediction ⁇
- the preset comparison result sequence is "correct prediction, wrong prediction”
- the number of forgetting times of the data to be processed obtained by executing S201 in this embodiment is 2; Forgot count is 1.
- An optional implementation method that can be adopted is: when it is determined that there is no alignment result used to characterize the correctness of the prediction in the comparison result sequence of the at least one data to be processed, mark the number of times of forgetting of the at least one data to be processed as preset
- the number of times of forgetting, the preset number of times of forgetting in this embodiment may be -1.
- the at least one data to be processed when performing S202 to determine at least one data to be processed in the data to be processed according to the obtained number of times of forgetting, can be sorted according to the order of the number of times of forgetting from high to low, so that the The data to be processed in the first N bits is used as the data to be inspected, and N is a positive integer greater than or equal to 1.
- the optional implementation method that can be used is: for each data to be processed, obtain the output result of the target neural network model when it predicts the data to be processed for the last time, for example, obtain the output result of the target neural network model The highest prediction probability output during prediction; according to the output result of at least one data to be processed and the number of times of forgetting, determine the data to be quality inspected in the at least one data to be processed.
- an optional implementation method that can be adopted is: according to the output result and the number of times of forgetting , to obtain the probability score of the data to be processed being marked incorrectly.
- the probability score can be obtained by adding or multiplying the output result and the number of times of forgetting;
- the processed data is sorted, so that the data to be processed in the top M positions is used as the data to be inspected, and M is a positive integer greater than or equal to 1.
- the following method can also be used: according to the number of times of forgetting of at least one data to be processed, determine the labeling accuracy rate of at least one data to be processed; use the labeling accuracy rate of at least one data to be processed and at least one data to be processed, Generate statistical charts, such as histograms.
- the correct rate of labeling of at least one piece of data to be processed may be determined according to the corresponding relationship between the number of times of forgetting and the correct rate of labeling, and the number of times of forgetting of at least one piece of data to be processed.
- the labeling accuracy rate of data to be processed with a forgetting frequency of -1 is between 0 and 0.2; the labeling accuracy rate of data to be processed with a forgetting frequency of less than 2 is between 0.8 and 1;
- the labeling accuracy rate of the data to be processed and above is between 0.2 and 0.8, and the data to be processed can be divided into 4 equal parts from high to low according to the number of times the data to be processed is forgotten, and the labeling of each equal data is correct
- the ratios are 0 ⁇ 0.2, 0.2 ⁇ 0.4, 0.4 ⁇ 0.6 and 0.6 ⁇ 0.8 respectively.
- FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in Figure 3, the data quality inspection method of this embodiment specifically includes the following steps:
- the data to be inspected is obtained according to the data processing method disclosed in the first embodiment of the present disclosure and the second embodiment of the present disclosure. Since the automatic screening of the data to be inspected is realized, the data quality can be improved. Improve the efficiency and accuracy of data inspection, and reduce the cost of data quality inspection.
- the data to be inspected when executing S302 to perform quality inspection on the acquired data to be inspected and obtain the quality inspection result, the data to be inspected can be sent to the input terminal, and then the input terminal can be obtained to relabel the sent data to be inspected The labeling result of is used as the quality inspection result of the data to be quality inspected.
- FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure.
- Fig. 4 shows the operation flowchart of the data processing system of this embodiment: the data processing system of this embodiment includes an interactive display layer, a business layer, a service layer, a task scheduling layer and a data layer; wherein, the interactive display layer is used for Obtain the quality inspection request input from the input terminal, and display the data to be inspected by the task scheduling layer; the business layer is used to initiate a request to the service layer according to the quality inspection request obtained by the interactive display layer; the service layer is used to Request to obtain the data to be processed from the data layer, and detect the number of times that the data to be processed is forgotten; Data is sent to the interactive display layer.
- the interactive display layer is used for Obtain the quality inspection request input from the input terminal, and display the data to be inspected by the task scheduling layer
- the business layer is used to initiate a request to the service layer according to the quality inspection request obtained by the interactive display layer
- the service layer is used to Request to obtain the
- FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. As shown in Figure 5, the data processing device 500 of this embodiment includes:
- the first acquiring unit 501 is configured to acquire at least one piece of data to be processed, where the at least one piece of data to be processed is labeled data to be processed;
- the prediction unit 502 is configured to use the target neural network model to perform a preset number of predictions on the at least one data to be processed, and obtain a prediction result of each prediction of the at least one data to be processed;
- a generation unit 503 configured to generate a comparison result sequence of the at least one data to be processed according to the labeling result of the at least one data to be processed and the prediction result of the at least one data to be processed in each prediction;
- the processing unit 504 is configured to determine the data to be quality checked in the at least one data to be processed according to the comparison result sequence of the at least one data to be processed.
- the data processing apparatus in this embodiment may be located in a cloud server, or may be located in a terminal device.
- the at least one piece of data to be processed acquired by the first acquisition unit 501 is data such as images, texts, and audios that have been manually or automatically labeled, that is, besides the original data, it also includes the labeling results of the original data.
- the labeling results of the data to be processed acquired by the first acquiring unit 501 may be category recognition results, object recognition results, text recognition results, and the like.
- the first acquisition unit 501 may use at least one piece of data input from the input terminal as at least one piece of data to be processed; At least one piece of data corresponding to the search request is used as at least one piece of data to be processed.
- an optional implementation method that can be adopted is: acquire a quality inspection request, which is sent by the input end and includes data identification information; At least one piece of data corresponding to the identification information is used as at least one piece of data to be processed.
- the first acquisition unit 501 pre-stores different data in the database, and after acquiring the quality inspection request sent by the input terminal, uses the data corresponding to the quality inspection request in the database as the data to be processed, without the need for the input terminal to perform
- the operation of data input simplifies the operation steps at the input end, thereby improving the efficiency of obtaining the data to be inspected.
- At least one piece of data to be processed acquired by the first acquiring unit 501 has the same data type.
- the prediction unit 502 uses the target neural network model to perform a preset number of predictions on the at least one piece of data to be processed to obtain at least one piece of data to be processed in each prediction time prediction results.
- the prediction unit 502 first determines a target neural network model, and then uses the target neural network model to obtain a prediction result of at least one data to be processed.
- the prediction unit 502 can determine the target neural network model according to the obtained quality inspection request, that is, the obtained quality inspection request further includes model type information in addition to data identification information, and the neural network model corresponding to the model type information , as the target neural network model.
- the data processing device 500 of this embodiment may further include a determination unit 505, configured to determine the target neural network model in the following manner: according to the labeling result of at least one data to be processed, determine the training task used to characterize the neural network model Task information; use the neural network model corresponding to the determined task information as the target neural network model.
- a determination unit 505 configured to determine the target neural network model in the following manner: according to the labeling result of at least one data to be processed, determine the training task used to characterize the neural network model Task information; use the neural network model corresponding to the determined task information as the target neural network model.
- the determination unit 505 can determine the target neural network model according to the labeling result of the data to be processed, Thereby further improving the intelligence and efficiency when obtaining the data to be inspected.
- the prediction unit 502 determines the target neural network model, it can use the target neural network model to perform preset times of training on at least one piece of data to be processed, so as to obtain the prediction result of at least one neural network model in each prediction.
- the preset number of times in the prediction unit 502 may be a preset number of times; or the number of training times corresponding to at least one task information of the data to be processed may be used as the preset number of times according to the correspondence between task information and training times.
- the prediction unit 502 uses the target neural network model to predict at least one data to be processed for a preset number of times, it may adopt a distributed training method, that is, multiple nodes use the target neural network model to predict different data to be processed , each node saves the prediction result of the corresponding data to be processed, and records the number of training times and the node serial number at the same time.
- the generation unit 503 uses the labeling result of at least one data to be processed and the prediction of at least one data to be processed at each prediction As a result, at least one alignment result sequence of the data to be processed is generated.
- the method is: compare the labeling result of at least one data to be processed with the prediction result of at least one data to be processed in each prediction, and obtain the at least one data to be processed in each prediction used to indicate that the prediction is correct or predicted Wrong comparison result: generating at least one comparison result sequence of the data to be processed according to the comparison result of the at least one data to be processed at each prediction, which is used to indicate whether the prediction is correct or the prediction is wrong.
- the comparison result sequence generated by the generation unit 503 can reflect the prediction of at least one data to be processed when the target neural network model is trained for a preset number of times, so as to realize the determination of the data to be processed according to the comparison result sequence.
- the purpose of labeling quality of processing data is to be described.
- the processing unit 504 determines the data to be quality checked in the at least one data to be processed according to the comparison result sequence of the at least one data to be processed.
- the number of data to be inspected determined by the processing unit 504 may be one or multiple.
- the processing unit 504 determines the data to be quality inspected according to the comparison result sequence of the data to be processed, it can determine the number of prediction errors in the comparison result sequence, and then use the data to be processed with the number of prediction errors exceeding the preset number threshold as the data to be processed. Quality inspection data.
- the processing unit 504 determines the labeling quality of the data to be processed according to the generated comparison result sequence, and uses the data to be processed with poor labeling quality (multiple travel prediction errors) as the data to be quality inspected, so as to realize from at least one
- the purpose of screening the data to be processed is to obtain the data to be inspected, and then return the determined data to be inspected to the input end, so that the input end can confirm or relabel the data to be inspected.
- the processing unit 504 determines the at least one quality inspection data in the data to be processed according to the comparison result sequence of the at least one data to be processed, the following content can also be included: according to the comparison result sequence of the at least one data to be processed, at least The number of times of forgetting of one piece of data to be processed; according to the number of times of forgetting of at least one piece of data to be processed, the data to be quality-checked in the at least one piece of data to be processed is determined.
- the processing unit 504 obtains the number of times of forgetting of the data to be processed according to the comparison result sequence of the data to be processed, and then uses the obtained number of times of forgetting to determine the data to be quality inspected from at least one piece of data to be processed.
- the method of obtaining the number of times of forgetting by sequence can improve the accuracy of the determined data to be inspected.
- an optional implementation method that can be adopted is: counting the comparison result sequence of at least one data to be processed Among them, the number of times the sequence of the preset comparison results appears; the counted number of times is used as the number of times of forgetting of at least one data to be processed.
- the processing unit 504 obtains the number of times of forgetting of at least one data to be processed according to the comparison result sequence of at least one data to be processed, it can use
- the optional implementation method is: when it is determined that there is no predicted correct comparison result in the comparison result sequence of at least one data to be processed, mark the number of times of forgetting of at least one data to be processed as the preset number of times of forgetting, this implementation
- the default number of forgetting in the example can be -1.
- the processing unit 504 determines at least one data to be processed in the data to be quality inspected according to the obtained number of times of forgetting, it can sort the at least one data to be processed according to the order of the number of times of forgetting from high to low, so that the top N
- the data to be processed is used as the data to be inspected, and N is a positive integer greater than or equal to 1.
- an optional implementation method that can be adopted is: for each data to be processed, the target neural network model is obtained in The output result of the last prediction of the data to be processed; according to the output result of the at least one data to be processed and the number of times of forgetting, determine the data to be quality inspected in the at least one data to be processed.
- an optional implementation method that can be adopted is: according to the output result and the number of times of forgetting, obtain The probability score of the data to be processed is marked incorrectly; according to the order of the probability score from high to low, at least one data to be processed is sorted, so that the data to be processed in the top M positions are used as the data to be quality inspected, and M is greater than or equal to A positive integer of 1.
- the processing unit 504 determines at least one data to be processed according to the number of times of forgetting of the at least one data to be processed.
- the following methods can also be adopted: according to the number of times of forgetting of at least one data to be processed, determine the labeling accuracy rate of at least one data to be processed; use the labeling accuracy rate of at least one data to be processed and at least one data to be processed to generate statistics Charts, such as histograms.
- the processing unit 504 may determine the correct rate of labeling of at least one piece of data to be processed according to the corresponding relationship between the number of times of forgetting and the rate of correctness of labeling by using the number of times of forgetting of at least one piece of data to be processed.
- FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. As shown in FIG. 6, the data quality inspection device 600 of this embodiment includes:
- the second acquiring unit 602 is configured to acquire the data to be inspected
- the quality inspection unit 603 is configured to perform quality inspection on the data to be inspected to obtain a quality inspection result.
- the second acquisition unit 602 acquires the data to be inspected according to the data processing device 500 of the fourth embodiment of the present disclosure. Since the automatic screening of the data to be inspected is realized, the efficiency and accuracy of data quality inspection can be improved, and the data quality can be reduced. inspection cost.
- the quality inspection unit 603 When the quality inspection unit 603 performs quality inspection on the acquired data to be inspected and obtains the quality inspection result, it can send the data to be inspected to the input terminal, and then obtain the input terminal to remark the sent data to be inspected. Mark the result as the quality inspection result of the data to be inspected.
- the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- FIG. 7 it is a block diagram of an electronic device according to a data processing or data quality inspection method according to an embodiment of the present disclosure.
- Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 700 includes a computing unit 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random-access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored.
- the computing unit 701 , ROM 702 and RAM 703 are connected to each other through a bus 704 .
- An input/output (I/O) interface 705 is also connected to the bus 704 .
- the I/O interface 705 includes: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
- the communication unit 709 allows the device 700 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
- the computing unit 701 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
- the computing unit 701 executes various methods and processes described above, such as data processing or data quality inspection methods.
- data processing or data quality inspection methods may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 .
- part or all of the computer program may be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709 .
- the computer program When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the data processing or data quality inspection method described above can be performed.
- the computing unit 701 may be configured in any other appropriate way (for example, by means of firmware) to execute data processing or data quality inspection methods.
- Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, systems integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC systems on chips system
- CPLD complex programmable logic device
- computer hardware firmware, software, and/or a combination thereof.
- programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
- Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
- the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
- RAM Random Access Memory
- ROM Read Only Memory
- EPROM Erasable Programmable Read Only Memory
- CD-ROM compact disk read only memory
- magnetic storage or any suitable combination of the foregoing.
- the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or a trackball
- Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
- the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
- a computer system may include clients and servers.
- Clients and servers are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
- the server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS”) Among them, there are defects such as difficult management and weak business scalability.
- the server can also be a server of a distributed system, or a server combined with a blockchain.
- steps may be reordered, added or deleted using the various forms of flow shown above.
- each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- General Factory Administration (AREA)
- Image Analysis (AREA)
Abstract
提供了一种数据处理与数据质检方法、装置及可读存储介质,涉及云服务、深度学习等人工智能技术领域。数据处理方法包括:获取至少一个待处理数据,至少一个待处理数据为经过标注的待处理数据(S101);使用目标神经网络模型对至少一个待处理数据进行预设次数的预测,得到至少一个待处理数据在每次预测时的预测结果(S102);根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列(S103);根据至少一个待处理数据的比对结果序列,确定至少一个待处理数据中的待质检数据(S104)。数据质检方法包括:根据数据处理方法获取待质检数据;对待质检数据进行质检,获得质检结果。
Description
本申请要求了申请日为2021年10月14日,申请号为202111197122.0发明名称为“数据处理与数据质检方法、装置及可读存储介质”的中国专利申请的优先权。
本公开涉及计算机技术领域,尤其涉及云服务、深度学习等人工智能技术领域,具体提供了一种数据处理与数据质检方法、装置、电子设备及可读存储介质。
数据标注质检是指对已标注数据的质量进行检验。随着人工智能(Artificial Intelligence,AI)技术的迅速发展,AI企业或个人开发者对数据的需求日益增长。而数据的标注质量,对AI算法的性能有着巨大的影响,只有大量客观、精准的已标注数据,才能有助于AI算法性能的提升。
在相关技术中,通常采用人工质检的方式来观察并判断已标注的数据是否存在标注错误。然而,这种通过人工质检的方式,可能因数据量大、人工疏忽等原因出现失误,且人工质检的成本较高。
发明内容
为了解决相关技术中存在的通过人工质检的方式获取待质检数据并进行质检存在的质检成本较高、质检准确性较差的技术问题,本公开提出一种数据处理与数据质检方法,用于实现自动地从待处理数据中筛选得到待质检数据的目的,能够降低获得待质检数据的成本,提升获得待质检数据的效率与准确性。
根据本公开的第一方面,提供了一种数据处理方法,包括:获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;根据所述至少 一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据。
根据本公开的第二方面,提供了一种数据质检方法,包括:根据数据处理方法获取待质检数据;对所述待质检数据进行质检,获得质检结果。
根据本公开的第三方面,提供了一种数据处理装置,包括:第一获取单元,用于获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;预测单元,用于使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;生成单元,用于根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;处理单元,用于根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据。
根据本公开的第四方面,提供了一种数据质检装置,包括:第二获取单元,用于根据数据处理装置获取待质检数据;质检单元,用于对所述待质检数据进行质检,获得质检结果。
根据本公开的第五方面,提供了一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的方法。
根据本公开的第六方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行如上所述的方法。
根据本公开的第七方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如上所述的方法。
由以上技术方案可以看出,本公开实现了自动地从待处理数据中筛选得到待质检数据的目的,能够降低获得待质检数据的成本,提升获得待质检数据的效率与准确性。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1是根据本公开第一实施例的示意图;
图2是根据本公开第二实施例的示意图;
图3是根据本公开第三实施例的示意图;
图4是根据本公开第四实施例的示意图;
图5是根据本公开第五实施例的示意图;
图6是根据本公开第六实施例的示意图;
图7是用来实现本公开实施例的数据处理或数据质检方法的电子设备的框图。
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和机构的描述。
图1是根据本公开第一实施例的示意图。如图1所示,本实施例的数据处理方法,具体包括如下步骤:
S101、获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;
S102、使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;
S103、根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;
S104、根据所述至少一个待处理数据的比对结果序列,确定所述至 少一个待处理数据中的待质检数据。
本实施例的数据处理方法,在使用目标神经网络模型对获取的至少一个待处理数据进行预设次数的预测之后,根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列,进而根据至少一个待处理数据的比对结果序列,从至少一个待处理数据中确定待质检数据,本实施例实现了自动地从待处理数据中筛选得到待质检数据的目的,能够降低获得待质检数据的成本,提升获得待质检数据的效率与准确性。
本实施例的数据处理方法的执行主体可以为云端服务器,也可以为终端设备。
本实施例执行S101获取的至少一个待处理数据,为经过人工标注或者自动标注的图像、文本、音频等数据,即除了原始数据之外,还包含原始数据的标注结果。其中,本实施例中的待处理数据的标注结果可以为类别识别结果、目标识别结果、文本识别结果等。
本实施例在执行S101获取至少一个待处理数据时,可以将输入端输入的至少一个数据,作为至少一个待处理数据;也可以根据输入端发送的质检请求,将数据库中与所接收的质检请求对应的至少一个数据,作为至少一个待处理数据。
其中,本实施例在执行S101获取至少一个待处理数据时,可以采用的可选实现方式为:获取质检请求,该质检请求由输入端发送,包含数据标识信息,该数据标识信息可以为数据集的ID等;将与所获取的数据标识信息对应的至少一个数据,作为至少一个待处理数据,例如将对应数据标识信息的数据集中的全部数据作为至少一个待处理数据。
也就是说,本实施例将不同的数据预先存储在数据库中,在获取输入端发送的质检请求之后,将数据库中与质检请求所对应的数据作为待处理数据,无需输入端进行数据输入的操作,简化了输入端的操作步骤,从而能够提升在得到待质检数据时的效率。
可以理解的是,本实施例执行S101获取的至少一个待处理数据具有相同的数据类型,例如所获取的至少一个待处理数据的数据类型为图像、文本与音频等中的一种。
本实施例在执行S101获取至少一个待处理数据之后,执行S102使 用目标神经网络模型对至少一个待处理数据进行预设次数的预测,得到至少一个待处理数据在每次预测时的预测结果。
本实施例在执行S102时,首先确定目标神经网络模型,然后使用该目标神经网络模型来得到至少一个待处理数据的预测结果。
本实施例在执行S102时,可以根据所获取的质检请求确定目标神经网络模型,即所获取的质检请求中除了包含数据标识信息之外,还进一步包含模型类型信息,例如目标检测类型、文本识别类型、图像分类类型等,将与模型类型信息对应的神经网络模型,作为目标神经网络模型。
另外,本实施例在执行S102时,还可以通过以下方式确定目标神经网络模型:根据至少一个待处理数据的标注结果,确定用于表征神经网络模型的训练任务的任务信息,神经网络模型的训练任务可以包含目标检测任务、文本识别任务、图像分类任务等;将与所确定的任务信息对应的神经网络模型,作为目标神经网络模型。其中,本实施例中的不同的神经网络模型用于完成不同的训练任务。
也就是说,本实施例能够在未从输入端发送的质检请求中获取模型类型信息的情况下,根据待处理数据的标注结果来进行目标神经网络模型的确定,从而进一步提升在得到待质检数据时的智能性与效率。
本实施例在执行S102确定了目标神经网络模型之后,即可使用该目标神经网络模型对至少一个待处理数据进行预设次数的训练,从而得到至少一个神经网络模型在每次预测时的预测结果。
其中,本实施例执行S102时的预设次数,可以预先设置的次数;也可以根据任务信息与训练次数之间的对应关系,将与至少一个待处理数据的任务信息对应的训练次数作为预设次数。
本实施例在执行S102使用目标神经网络模型对至少一个待处理数据进行预设次数的预测时,可以采用分布式训练的方式,即由多个节点,分别使用目标神经网络模型对不同的待处理数据进行预测,每个节点保存相应的待处理数据的预测结果,并同时记录训练次数与节点序号。
本实施例在执行S102得到至少一个待处理数据在每次预测时的预测结果之后,执行S103根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列。
具体地,本实施例在执行S103根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列时,可以采用的可选实现方式为:将至少一个待处理数据的标注结果分别与至少一个待处理数据在每次预测时的预测结果进行比对,得到至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果,若预测结果与标注结果一致,得到用于表征预测正确的比对结果,否则得到用于表征预测错误的比对结果;根据至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果,生成至少一个待处理数据的比对结果序列。
也就是说,本实施例生成的比对结果序列,能够反映目标神经网络模型在进行预设次数的训练时,对于至少一个待处理数据的预测情况,从而实现根据该比对结果序列来确定待处理数据的标注质量的目的。
举例来说,若待处理数据为数据1,预设次数为6次,使用目标神经网络模型对预测数据1所得到的预测结果分别为结果1、结果2、结果3、结果4、结果5与结果6,预测结果中仅结果1、结果4与标注结果一致,则本实施例执行S103生成的数据1的比对结果序列为{预测正确,预测错误,预测错误,预测正确,预测错误,预测错误}。
本实施例在执行S103生成至少一个待处理数据的比对结果序列之后,执行S104根据至少一个待处理数据的比对结果序列,确定至少一个待处理数据中的待质检数据。其中,本实施例执行S104确定的待质检数据的个数可以为一个,也可以为多个。
本实施例在执行S104根据待处理数据的比对结果序列确定待质检数据时,可以确定比对结果序列中出现预测错误的次数,进而将预测错误的次数超过预设次数阈值的待处理数据作为待质检数据。
也就是说,本实施例根据所生成的比对结果序列来确定待处理数据的标注质量,将标注质量较差(多次出行预测错误)的待处理数据作为待质检数据,实现从至少一个待处理数据中筛选得到待质检数据的目的,进而将所确定的待质检数据返回给输入端,以用于输入端对待质检数据进行确认或者重新标注。
图2是根据本公开第二实施例的示意图。如图2所示,本实施例在执行“S104根据所述至少一个待处理数据的比对结果序列,确定所述至 少一个待处理数据中的待质检数据”时,具体包括如下步骤:
S201、根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数;
S202、根据所述至少一个待处理数据的遗忘次数,确定所述至少一个待处理数据中的待质检数据。
本实施例中,待处理数据的“遗忘次数”是指待处理数据在经过目标神经网络模型的多次预测之后,在所得到的比对结果序列中,出现预设的比对结果顺序的次数。
本实施例根据待处理数据的比对结果序列来得到待处理数据的遗忘次数,进而使用所得到的遗忘次数,从至少一个待处理数据中确定待质检数据,通过比对结果序列来得到遗忘次数的方式,能够提升所确定的待质检数据的准确性。
具体地,本实施例在执行S201根据至少一个待处理数据的比对结果序列,得到至少一个待处理数据的遗忘次数时,可以采用的可选实现方式为:统计至少一个待处理数据的比对结果序列中,出现预设的比对结果顺序的次数;将统计得到的次数,作为至少一个待处理数据的遗忘次数。
举例来说,若待处理数据的比对结果序列为{预测正确,预测错误,预测正确,预测正确,预测错误,预测错误},若预设的比对结果顺序为“预测正确,预测错误”,本实施例执行S201得到的该待处理数据的遗忘次数为2;若预设的比对结果顺序为“预测正确,预测错误,预测错误”,本实施例执行S201得到的该待处理数据的遗忘次数为1。
由于在所得到的待处理数据的标注质量较差的情况下,目标神经网络模型根据待处理数据所得到的预测结果可能都是错误的,因此存在使用预设比对结果顺序无法得到待处理数据的遗忘次数的问题。
为了确保能够得到待处理数据的遗忘次数,提升所得到的遗忘次数的准确性,本实施例在执行S201根据至少一个待处理数据的比对结果序列,得到至少一个待处理数据的遗忘次数时,可以采用的可选实现方式为:在确定至少一个待处理数据的比对结果序列中不存在用于表征预测正确的比对结果的情况下,将至少一个待处理数据的遗忘次数标记为预设遗忘次数,本实施例中的预设遗忘次数可以为-1。
本实施例在执行S201得到至少一个待处理数据的遗忘次数之后,执行S202根据所得到的至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据。
本实施例在执行S202根据所得到的遗忘次数确定至少一个待处理数据中的待质检数据时,可以根据遗忘次数从高到低的顺序,将至少一个待处理数据进行排序,从而将排在前N位的待处理数据作为待质检数据,N为大于等于1的正整数。
在实际场景中存在多个待处理数据的遗忘次数相同的情况,但是这些遗忘次数相同的待处理数据具有不同的重要程度,因此直接根据所得到的遗忘次数来确定待质检数据时,可能会存在将比较重要的待处理数据遗漏的问题。
为了进一步避免比较重要的待处理数据被遗漏,提升所确定的待质检数据的准确性,本实施例在执行S202根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据时,可以采用的可选实现方式为:针对每个待处理数据,获取目标神经网络模型在最后一次预测该待处理数据时的输出结果,例如获取目标神经网络模型对待处理数据在最后一次预测时输出的最高预测概率;根据至少一个待处理数据的输出结果与遗忘次数,确定至少一个待处理数据中的待质检数据。
其中,本实施例在执行S202根据至少一个待处理数据的输出结果与遗忘次数,确定至少一个待处理数据中的待质检数据时,可以采用的可选实现方式为:根据输出结果与遗忘次数,得到待处理数据被标注错误的概率分数,本实施例可以采用将输出结果与遗忘次数进行相加或相乘的方式来得到概率分数;按照概率分数从高到低的顺序,将至少一个待处理数据进行排序,从而将排在前M位的待处理数据作为待质检数据,M为大于等于1的正整数。
为了更加直观地向输入端反馈待处理数据的标注情况,便于输入端准确地选取待质检数据,本实施例在执行S202根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据时,还可以采用以下方式:根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据的标注正确率;使用至少一个待处理数据与至少一个待处理数据的标注正确率,生成统计图表,例如直方图。
本实施例在执行S202时,可以根据遗忘次数与标注正确率之间的对应关系,通过至少一个待处理数据的遗忘次数,来确定至少一个待处理数据的标注正确率。
例如,本实施例中遗忘次数为-1的待处理数据的标注正确率在0~0.2之间;遗忘次数在2以下的待处理数据的标注正确率在0.8~1之间;遗忘次数为2及以上的待处理数据的标注正确率在0.2~0.8之间,还可以根据待处理数据的遗忘次数,从高到低将各待处理数据等分为4份,每份等分数据的标注正确率分别为0~0.2、0.2~0.4、0.4~0.6与0.6~0.8。
图3是根据本公开第三实施例的示意图。如图3所示,本实施例的数据质检方法,具体包括如下步骤:
S301、获取待质检数据;
S302、对所述待质检数据进行质检,获得质检结果。
本实施例在执行S301时,根据本公开第一实施例与本公开第二实施例所公开的数据处理方法获取待质检数据,由于实现了待质检数据的自动筛选,因此能够提升数据质检时的效率与准确性,降低数据质检的成本。
本实施例在执行S302对所获取的待质检数据进行质检,获得质检结果时,可以将待质检数据发送至输入端,进而获取输入端对所发送的待质检数据进行重新标注的标注结果,作为待质检数据的质检结果。
图4是根据本公开第四实施例的示意图。图4中示出了本实施例的数据处理系统的运行流程图:本实施例的数据处理系统包含交互展示层、业务层、服务层、任务调度层与数据层;其中,交互展示层用于获取输入端输入的质检请求,展示由任务调度层所筛选得到的待质检数据;业务层用于根据交互展示层所获取的质检请求向服务层发起请求;服务层用于根据质检请求从数据层获取待处理数据,并对待处理数据遗忘次数的检测;任务调度层获取服务层所检测得到的遗忘次数来确定待处理数据中的待质检数据,并将所确定的待质检数据发送至交互展示层。
图5是根据本公开第五实施例的示意图。如图5所示,本实施例的数据处理装置500,包括:
第一获取单元501、用于获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;
预测单元502、用于使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;
生成单元503、用于根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;
处理单元504、用于根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据。
本实施例的数据处理装置可以位于云端服务器,也可以位于终端设备。
第一获取单元501获取的至少一个待处理数据,为经过人工标注或者自动标注的图像、文本、音频等数据,即除了原始数据之外,还包含原始数据的标注结果。其中,第一获取单元501获取的待处理数据的标注结果可以为类别识别结果、目标识别结果、文本识别结果等。
第一获取单元501在获取至少一个待处理数据时,可以将输入端输入的至少一个数据,作为至少一个待处理数据;也可以根据输入端发送的质检请求,将数据库中与所接收的质检请求对应的至少一个数据,作为至少一个待处理数据。
其中,第一获取单元501在获取至少一个待处理数据时,可以采用的可选实现方式为:获取质检请求,该质检请求由输入端发送,包含数据标识信息;将与所获取的数据标识信息对应的至少一个数据,作为至少一个待处理数据。
也就是说,第一获取单元501将不同的数据预先存储在数据库中,在获取输入端发送的质检请求之后,将数据库中与质检请求所对应的数据作为待处理数据,无需输入端进行数据输入的操作,简化了输入端的操作步骤,从而能够提升在得到待质检数据时的效率。
可以理解的是,第一获取单元501获取的至少一个待处理数据具有相同的数据类型。
本实施例在由第一获取单元501获取至少一个待处理数据之后,由预测单元502使用目标神经网络模型对至少一个待处理数据进行预设次数的预测,得到至少一个待处理数据在每次预测时的预测结果。
预测单元502首先确定目标神经网络模型,然后使用该目标神经网络模型来得到至少一个待处理数据的预测结果。
预测单元502可以根据所获取的质检请求确定目标神经网络模型,即所获取的质检请求中除了包含数据标识信息之外,还进一步包含模型类型信息,将与模型类型信息对应的神经网络模型,作为目标神经网络模型。
另外,本实施例的数据处理装置500中还可以包含确定单元505,用于通过以下方式确定目标神经网络模型:根据至少一个待处理数据的标注结果,确定用于表征神经网络模型的训练任务的任务信息;将与所确定的任务信息对应的神经网络模型,作为目标神经网络模型。
也就是说,本实施例能够在预测单元501未从输入端发送的质检请求中获取模型类型信息的情况下,由确定单元505根据待处理数据的标注结果来进行目标神经网络模型的确定,从而进一步提升在得到待质检数据时的智能性与效率。
预测单元502在确定了目标神经网络模型之后,即可使用该目标神经网络模型对至少一个待处理数据进行预设次数的训练,从而得到至少一个神经网络模型在每次预测时的预测结果。
其中,预测单元502中的预设次数,可以预先设置的次数;也可以根据任务信息与训练次数之间的对应关系,将与至少一个待处理数据的任务信息对应的训练次数作为预设次数。
预测单元502使用目标神经网络模型对至少一个待处理数据进行预设次数的预测时,可以采用分布式训练的方式,即由多个节点,分别使用目标神经网络模型对不同的待处理数据进行预测,每个节点保存相应的待处理数据的预测结果,并同时记录训练次数与节点序号。
本实施例在由预测单元502得到至少一个待处理数据在每次预测时的预测结果之后,由生成单元503根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列。
具体地,生成单元503在根据至少一个待处理数据的标注结果与至少一个待处理数据在每次预测时的预测结果,生成至少一个待处理数据的比对结果序列时,可以采用的可选实现方式为:将至少一个待处理数 据的标注结果分别与至少一个待处理数据在每次预测时的预测结果进行比对,得到至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果;根据至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果,生成至少一个待处理数据的比对结果序列。
也就是说,生成单元503生成的比对结果序列,能够反映目标神经网络模型在进行预设次数的训练时,对于至少一个待处理数据的预测情况,从而实现根据该比对结果序列来确定待处理数据的标注质量的目的。
本实施例在由生成单元503生成至少一个待处理数据的比对结果序列之后,由处理单元504根据至少一个待处理数据的比对结果序列,确定至少一个待处理数据中的待质检数据。其中,处理单元504确定的待质检数据的个数可以为一个,也可以为多个。
处理单元504在根据待处理数据的比对结果序列确定待质检数据时,可以确定比对结果序列中出现预测错误的次数,进而将预测错误的次数超过预设次数阈值的待处理数据作为待质检数据。
也就是说,处理单元504根据所生成的比对结果序列来确定待处理数据的标注质量,将标注质量较差(多次出行预测错误)的待处理数据作为待质检数据,实现从至少一个待处理数据中筛选得到待质检数据的目的,进而将所确定的待质检数据返回给输入端,以用于输入端对待质检数据进行确认或者重新标注。
处理单元504在根据至少一个待处理数据的比对结果序列,确定至少一个待处理数据中的待质检数据时,还可以包含以下内容:根据至少一个待处理数据的比对结果序列,得到至少一个待处理数据的遗忘次数;根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据。
也就是说,处理单元504根据待处理数据的比对结果序列来得到待处理数据的遗忘次数,进而使用所得到的遗忘次数,从至少一个待处理数据中确定待质检数据,通过比对结果序列来得到遗忘次数的方式,能够提升所确定的待质检数据的准确性。
具体地,处理单元504在根据至少一个待处理数据的比对结果序列,得到至少一个待处理数据的遗忘次数时,可以采用的可选实现方式为: 统计至少一个待处理数据的比对结果序列中,出现预设比对结果顺序的次数;将统计得到的次数,作为至少一个待处理数据的遗忘次数。
由于在所得到的待处理数据的标注质量较差的情况下,目标神经网络模型根据待处理数据所得到的预测结果可能都是错误的,因此存在使用预设比对结果顺序无法得到待处理数据的遗忘次数的问题。
为了确保能够得到待处理数据的遗忘次数,提升所得到的遗忘次数的准确性,处理单元504在根据至少一个待处理数据的比对结果序列,得到至少一个待处理数据的遗忘次数时,可以采用的可选实现方式为:在确定至少一个待处理数据的比对结果序列中不存在预测正确的比对结果的情况下,将至少一个待处理数据的遗忘次数标记为预设遗忘次数,本实施例中的预设遗忘次数可以为-1。
处理单元504在根据所得到的遗忘次数确定至少一个待处理数据中的待质检数据时,可以根据遗忘次数从高到低的顺序,将至少一个待处理数据进行排序,从而将排在前N位的待处理数据作为待质检数据,N为大于等于1的正整数。
在实际场景中存在多个待处理数据的遗忘次数相同的情况,但是这些遗忘次数相同的待处理数据具有不同的重要程度,因此直接根据所得到的遗忘次数来确定待质检数据时,可能会存在将比较重要的待处理数据遗漏的问题。
处理单元504在根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据时,可以采用的可选实现方式为:针对每个待处理数据,获取目标神经网络模型在最后一次预测该待处理数据时的输出结果;根据至少一个待处理数据的输出结果与遗忘次数,确定至少一个待处理数据中的待质检数据。
其中,处理单元504在根据至少一个待处理数据的输出结果与遗忘次数,确定至少一个待处理数据中的待质检数据时,可以采用的可选实现方式为:根据输出结果与遗忘次数,得到待处理数据被标注错误的概率分数;按照概率分数从高到低的顺序,将至少一个待处理数据进行排序,从而将排在前M位的待处理数据作为待质检数据,M为大于等于1的正整数。
为了更加直观地向输入端反馈待处理数据的标注情况,便于输入端 准确地选取待质检数据,处理单元504在根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据中的待质检数据时,还可以采用以下方式:根据至少一个待处理数据的遗忘次数,确定至少一个待处理数据的标注正确率;使用至少一个待处理数据与至少一个待处理数据的标注正确率,生成统计图表,例如直方图。
处理单元504可以根据遗忘次数与标注正确率之间的对应关系,通过至少一个待处理数据的遗忘次数,来确定至少一个待处理数据的标注正确率。
图6是根据本公开第六实施例的示意图。如图6所示,本实施例的数据质检装置600,包括:
第二获取单元602、用于获取待质检数据;
质检单元603、用于对所述待质检数据进行质检,获得质检结果。
第二获取单元602根据本公开第四实施例的数据处理装置500获取待质检数据,由于实现了待质检数据的自动筛选,因此能够提升数据质检时的效率与准确性,降低数据质检的成本。
质检单元603在对所获取的待质检数据进行质检,获得质检结果时,可以将待质检数据发送至输入端,进而获取输入端对所发送的待质检数据进行重新标注的标注结果,作为待质检数据的质检结果。
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
如图7所示,是根据本公开实施例的数据处理或数据质检方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图7所示,设备700包括计算单元701,其可以根据存储在只读 存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM703中,还可存储设备700操作所需的各种程序和数据。计算单元701、ROM702以及RAM703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如数据处理或数据质检方法。例如,在一些实施例中,数据处理或数据质检方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。
在一些实施例中,计算机程序的部分或者全部可以经由ROM702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时,可以执行上文描述的数据处理或数据质检方法的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行数据处理或数据质检方法。
此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可 以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或 者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(“Virtual Private Server”,或简称“VPS”)中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。
Claims (21)
- 一种数据处理方法,包括:获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据。
- 根据权利要求1所述的方法,其中,所述获取至少一个待处理数据包括:获取质检请求,所述质检请求中包含数据标识信息;将与所述数据标识信息对应的至少一个数据,作为所述至少一个待处理数据。
- 根据权利要求1所述的方法,还包括:通过以下方式确定所述目标神经网络:根据所述至少一个待处理数据的标注结果,确定用于表征神经网络模型的训练任务的任务信息;将与所述任务信息对应的神经网络模型,作为所述目标神经网络模型。
- 根据权利要求1所述的方法,其中,所述根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列包括:将所述至少一个待处理数据的标注结果分别与所述至少一个待处理数据在每次预测时的预测结果进行比对,得到所述至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果;根据所述至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果,生成所述至少一个待处理数据的比对结果序列。
- 根据权利要求1-4中任一项所述的方法,其中,所述根据所述至 少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据包括:根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数;根据所述至少一个待处理数据的遗忘次数,确定所述至少一个待处理数据中的待质检数据。
- 根据权利要求5所述的方法,其中,所述根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数包括:统计所述至少一个待处理数据的比对结果序列中,出现预设的比对结果顺序的次数;将统计得到的次数,作为所述至少一个待处理数据的遗忘次数。
- 根据权利要求5所述的方法,其中,所述根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数包括:在确定所述至少一个待处理数据的比对结果序列中不存在用于表征预测正确的比对结果的情况下,将所述至少一个待处理数据的遗忘次数标记为预设遗忘次数。
- 根据权利要求5所述的方法,其中,所述根据所述至少一个待处理数据的遗忘次数,确定所述至少一个待处理数据中的待质检数据包括:针对每个待处理数据,获取所述目标神经网络模型在最后一次预测该待处理数据时的输出结果;根据所述至少一个待处理数据的输出结果与所述遗忘次数,确定所述至少一个待处理数据中的待质检数据。
- 一种数据质检方法,包括:根据权利要求1-8中任一项所述的方法获取待质检数据;对所述待质检数据进行质检,获得质检结果。
- 一种数据处理装置,包括:第一获取单元,用于获取至少一个待处理数据,所述至少一个待处理数据为经过标注的待处理数据;预测单元,用于使用目标神经网络模型对所述至少一个待处理数据进行预设次数的预测,得到所述至少一个待处理数据在每次预测时的预测结果;生成单元,用于根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列;处理单元,用于根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据。
- 根据权利要求10所述的装置,其中,所述第一获取单元在获取至少一个待处理数据时,执行:获取质检请求,所述质检请求中包含数据标识信息;将与所述数据标识信息对应的至少一个数据,作为所述至少一个待处理数据。
- 根据权利要求10所述的装置,还包括确定单元:用于通过以下方式确定所述目标神经网络:根据所述至少一个待处理数据的标注结果,确定用于表征神经网络模型的训练任务的任务信息;将与所述任务信息对应的神经网络模型,作为所述目标神经网络模型。
- 根据权利要求10所述的装置,其中,所述生成单元在根据所述至少一个待处理数据的标注结果与所述至少一个待处理数据在每次预测时的预测结果,生成所述至少一个待处理数据的比对结果序列时,执行:将所述至少一个待处理数据的标注结果分别与所述至少一个待处理数据在每次预测时的预测结果进行比对,得到所述至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果;根据所述至少一个待处理数据在每次预测时的用于表征预测正确或者预测错误的比对结果,生成所述至少一个待处理数据的比对结果序列。
- 根据权利要求10-13中任一项所述的装置,其中,所述处理单元在根据所述至少一个待处理数据的比对结果序列,确定所述至少一个待处理数据中的待质检数据时,执行:根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数;根据所述至少一个待处理数据的遗忘次数,确定所述至少一个待处理数据中的待质检数据。
- 根据权利要求14所述的装置,其中,所述处理单元在根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数时,执行:统计所述至少一个待处理数据的比对结果序列中,出现预设的比对结果顺序的次数;将统计得到的次数,作为所述至少一个待处理数据的遗忘次数。
- 根据权利要求14所述的装置,其中,所述处理单元在根据所述至少一个待处理数据的比对结果序列,得到所述至少一个待处理数据的遗忘次数时,执行:在确定所述至少一个待处理数据的比对结果序列中不存在用于表征预测正确的比对结果的情况下,将所述至少一个待处理数据的遗忘次数标记为预设遗忘次数。
- 根据权利要求14所述的装置,其中,所述处理单元在根据所述至少一个待处理数据的遗忘次数,确定所述至少一个待处理数据中的待质检数据时,执行:针对每个待处理数据,获取所述目标神经网络模型在最后一次预测该待处理数据时的输出结果;根据所述至少一个待处理数据的输出结果与所述遗忘次数,确定所述至少一个待处理数据中的待质检数据。
- 一种数据质检装置,包括:第二获取单元,用于根据权利要求10-17中任一项所述的装置获取待质检数据;质检单元,用于对所述待质检数据进行质检,获得质检结果。
- 一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-9中任一项所述的方法。
- 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1-9中任一项所述的方 法。
- 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-9中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111197122.0A CN114116688B (zh) | 2021-10-14 | 2021-10-14 | 数据处理与数据质检方法、装置及可读存储介质 |
CN202111197122.0 | 2021-10-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023060954A1 true WO2023060954A1 (zh) | 2023-04-20 |
Family
ID=80376115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/105122 WO2023060954A1 (zh) | 2021-10-14 | 2022-07-12 | 数据处理与数据质检方法、装置及可读存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114116688B (zh) |
WO (1) | WO2023060954A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114116688B (zh) * | 2021-10-14 | 2024-05-28 | 北京百度网讯科技有限公司 | 数据处理与数据质检方法、装置及可读存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009027404A (ja) * | 2007-07-19 | 2009-02-05 | Fuji Xerox Co Ltd | ジョブ管理装置及びプログラム |
CN110009090A (zh) * | 2019-04-02 | 2019-07-12 | 北京市商汤科技开发有限公司 | 神经网络训练与图像处理方法及装置 |
US20190354836A1 (en) * | 2018-05-17 | 2019-11-21 | International Business Machines Corporation | Dynamic discovery of dependencies among time series data using neural networks |
CN113222149A (zh) * | 2021-05-31 | 2021-08-06 | 联仁健康医疗大数据科技股份有限公司 | 模型训练方法、装置、设备和存储介质 |
CN113343695A (zh) * | 2021-05-27 | 2021-09-03 | 镁佳(北京)科技有限公司 | 一种文本标注噪声检测方法、装置、存储介质及电子设备 |
CN114116688A (zh) * | 2021-10-14 | 2022-03-01 | 北京百度网讯科技有限公司 | 数据处理与数据质检方法、装置及可读存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110738264A (zh) * | 2019-10-18 | 2020-01-31 | 上海眼控科技股份有限公司 | 异常样本筛选、清洗、训练方法、装置、设备和存储介质 |
CN111046959A (zh) * | 2019-12-12 | 2020-04-21 | 上海眼控科技股份有限公司 | 模型训练方法、装置、设备和存储介质 |
CN111325260B (zh) * | 2020-02-14 | 2023-10-27 | 北京百度网讯科技有限公司 | 数据处理方法及装置、电子设备、计算机可读介质 |
CN113010571B (zh) * | 2021-03-12 | 2024-08-06 | 北京百度网讯科技有限公司 | 数据检测方法、装置、电子设备、存储介质和程序产品 |
-
2021
- 2021-10-14 CN CN202111197122.0A patent/CN114116688B/zh active Active
-
2022
- 2022-07-12 WO PCT/CN2022/105122 patent/WO2023060954A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009027404A (ja) * | 2007-07-19 | 2009-02-05 | Fuji Xerox Co Ltd | ジョブ管理装置及びプログラム |
US20190354836A1 (en) * | 2018-05-17 | 2019-11-21 | International Business Machines Corporation | Dynamic discovery of dependencies among time series data using neural networks |
CN110009090A (zh) * | 2019-04-02 | 2019-07-12 | 北京市商汤科技开发有限公司 | 神经网络训练与图像处理方法及装置 |
CN113343695A (zh) * | 2021-05-27 | 2021-09-03 | 镁佳(北京)科技有限公司 | 一种文本标注噪声检测方法、装置、存储介质及电子设备 |
CN113222149A (zh) * | 2021-05-31 | 2021-08-06 | 联仁健康医疗大数据科技股份有限公司 | 模型训练方法、装置、设备和存储介质 |
CN114116688A (zh) * | 2021-10-14 | 2022-03-01 | 北京百度网讯科技有限公司 | 数据处理与数据质检方法、装置及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114116688B (zh) | 2024-05-28 |
CN114116688A (zh) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4006909B1 (en) | Method, apparatus and device for quality control and storage medium | |
US10067983B2 (en) | Analyzing tickets using discourse cues in communication logs | |
WO2020082673A1 (zh) | 发票检验方法、装置、计算设备和存储介质 | |
US12118770B2 (en) | Image recognition method and apparatus, electronic device and readable storage medium | |
WO2023240878A1 (zh) | 一种资源识别方法、装置、设备以及存储介质 | |
EP4216079A1 (en) | Product recognition method, model training method, device and electronic device | |
US11301355B2 (en) | Method, electronic device, and computer program product for analyzing log file | |
US11488579B2 (en) | Evaluating language models using negative data | |
US20240221404A1 (en) | Method of training text quality assessment model and method of determining text quality | |
US20220350690A1 (en) | Training method and apparatus for fault recognition model, fault recognition method and apparatus, and electronic device | |
US8938405B2 (en) | Classifying activity using probabilistic models | |
WO2023236405A1 (zh) | 端到端敏感文本召回模型的训练方法、敏感文本召回方法 | |
WO2023060954A1 (zh) | 数据处理与数据质检方法、装置及可读存储介质 | |
CN114692778A (zh) | 用于智能巡检的多模态样本集生成方法、训练方法及装置 | |
CN112699671B (zh) | 一种语言标注方法、装置、计算机设备和存储介质 | |
CN113076939B (zh) | 语境化字符识别系统 | |
CN111738290B (zh) | 图像检测方法、模型构建和训练方法、装置、设备和介质 | |
CN117668192A (zh) | 一种数据处理方法、装置、设备以及存储介质 | |
CN110826616B (zh) | 信息处理方法及装置、电子设备、存储介质 | |
US20230342561A1 (en) | Machine translation method and apparatus, device and storage medium | |
US20220327450A1 (en) | Method for increasing or decreasing number of workers and inspectors in crowdsourcing-based project for creating artificial intelligence learning data | |
US11886320B2 (en) | Diagnosing application problems by learning from fault injections | |
CN114443493A (zh) | 一种测试案例生成方法、装置、电子设备和存储介质 | |
CN111240652A (zh) | 数据处理方法及装置、计算机存储介质、电子设备 | |
WO2021017288A1 (zh) | 对系统错误进行重复识别的方法、装置、电子设备和计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22879901 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22879901 Country of ref document: EP Kind code of ref document: A1 |