CN114548192A - Sample data processing method and device, electronic equipment and medium - Google Patents

Sample data processing method and device, electronic equipment and medium Download PDF

Info

Publication number
CN114548192A
CN114548192A CN202011324242.8A CN202011324242A CN114548192A CN 114548192 A CN114548192 A CN 114548192A CN 202011324242 A CN202011324242 A CN 202011324242A CN 114548192 A CN114548192 A CN 114548192A
Authority
CN
China
Prior art keywords
processed
sample data
sample
labeling
processing result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011324242.8A
Other languages
Chinese (zh)
Inventor
赵一欣
李雨朋
金砺耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxun Spatial Intelligence Inc
Original Assignee
Qianxun Spatial Intelligence Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxun Spatial Intelligence Inc filed Critical Qianxun Spatial Intelligence Inc
Priority to CN202011324242.8A priority Critical patent/CN114548192A/en
Publication of CN114548192A publication Critical patent/CN114548192A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a sample data processing method, a sample data processing device, an electronic device and a computer readable medium; relates to the technical field of big data processing. The sample data processing method comprises the following steps: step S1: training by using a training data set containing labels to obtain a screening model; step S2: acquiring sample data to be processed, and carrying out primary labeling on the sample data to be processed; step S3: outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence level, extracting an available sample from the sample to be processed according to the processing result, and the confidence level of the screening model for the output of the available sample is smaller than a first threshold value; step S4: and carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set. The technical scheme of the embodiment of the disclosure can improve the efficiency and accuracy of sample labeling.

Description

Sample data processing method and device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of big data processing technologies, and in particular, to a sample data processing method, a sample data processing apparatus, an electronic device, and a computer-readable medium.
Background
In recent years, machine learning techniques have been developed more and more rapidly, and have gradually become core techniques in various fields, such as image recognition, natural language processing, automatic driving, and the like. The machine learning technology needs to obtain sample data, and then marks the sample data to establish a mapping relation between input and output of the sample data to obtain a model, so that the marking of the sample data is very important.
The marking of the sample data is mainly performed manually at present, and although the marking precision of manual marking is high, the efficiency is low, the cost is high, and therefore the improvement of the sample marking efficiency becomes the focus of research.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a sample data processing method, a sample data processing apparatus, an electronic device, and a computer readable medium, which can automatically perform a primary labeling on sample data, select an available sample with higher quality through the primary labeling, perform a secondary labeling again to obtain a labeled sample with higher accuracy, and improve the labeling efficiency of the sample data and the labeling accuracy.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the embodiments of the present disclosure, a sample data processing method is provided, including: step S1: training by using a training data set containing labels to obtain a screening model; step S2: acquiring sample data to be processed, and carrying out primary labeling on the sample data to be processed; step S3: outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence level, extracting an available sample from the sample to be processed according to the processing result, and the confidence level of the screening model for the output of the available sample is smaller than a first threshold value; step S4: and carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set.
In an exemplary embodiment of the present disclosure, the method further comprises: step S5: updating the screening model through the expanded training data set, and updating the sample data to be processed; step S6: and circularly executing the steps S3-S5 through the updated screening model and the updated sample data to be processed.
In an exemplary embodiment of the present disclosure, when the evaluation index of the screening model satisfies a preset condition, the execution of the loop is stopped.
In an exemplary embodiment of the present disclosure, the primarily labeling includes labeling a position of a target object in the sample data to be processed, and when the processing result includes the position of the target object in the sample data to be processed, the extracting, according to the processing result, an available sample in the sample to be processed includes: and extracting sample data to be processed, of which the deviation between the position of the processing result output by the screening model and the position marked for the first time is greater than a second threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the primarily labeling includes labeling a contour of a target object in the sample data to be processed, and when the processing result includes the contour of the target object in the sample data to be processed, the extracting, according to the processing result, an available sample in the sample to be processed includes: and extracting sample data to be processed, of which the deviation between the contour of the processing result output by the screening model and the contour labeled for the first time is greater than a third threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the processing result includes a category of a target object in the sample data to be processed; the secondarily labeling the available samples and expanding the available samples containing the secondary labels to the training data set comprise:
obtaining a template corresponding to the target object according to the category of the target object; carrying out secondary labeling on the available samples to obtain secondary labeling results; and calculating a similarity evaluation index of the template and the secondary labeling result, and expanding the training data set by using available samples of which the similarity evaluation index is greater than a fourth threshold value.
In an exemplary embodiment of the present disclosure, the primarily labeling includes labeling a category of a target object in the sample data to be processed, and extracting an available sample in the sample to be processed according to the processing result includes: and when the category of the target object in the processing result is inconsistent with the category of the primary labeling target object, taking the corresponding sample to be processed as the available sample.
According to a second aspect of the embodiments of the present disclosure, a sample data processing apparatus is provided, which may include a model training module, a sample labeling module, a data screening module, and a sample data determining module.
The model training module is used for training by using a training data set containing labels to obtain a screening model; the sample marking module is used for acquiring sample data to be processed and marking the sample data to be processed for the first time; the data screening module is used for outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence coefficient, and an available sample in the sample to be processed is extracted according to the processing result, wherein the confidence coefficient output by the screening model on the available sample is smaller than a first threshold value; and the sample data determining module is used for carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set.
In an exemplary embodiment of the present disclosure, the sample data processing apparatus further includes a data updating module and a loop module.
And the data updating module is used for updating the screening model through the expanded training data set and updating the sample data to be processed.
And the circulating module is used for circulating the model training module, the sample labeling module, the data screening module, the sample data determining module and the data updating module through the updated screening model and the updated sample data to be processed.
In an exemplary embodiment of the present disclosure, the loop module may be configured to: and when the evaluation index of the screening model meets a preset condition, stopping executing the circulation.
In an exemplary embodiment of the present disclosure, the processing result further includes a position or a contour of a target object in the sample data to be processed.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a position of a target object in the sample data to be processed, and when the processing result includes the position of the target object in the sample data to be processed, the data filtering module is configured to: and extracting sample data to be processed, of which the deviation between the position of the processing result output by the screening model and the position marked for the first time is greater than a second threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a contour of a target object in the sample data to be processed, and when the processing result includes the contour of the target object in the sample data to be processed, the data filtering module is configured to: and extracting the sample data to be processed, of which the deviation between the contour of the processing result output by the screening model and the contour marked for the first time is larger than a third threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the processing result includes a category of a target object in the sample data to be processed; the sample data determination module may be to: obtaining a template corresponding to the target object according to the category of the target object; carrying out secondary labeling on the available samples to obtain a secondary labeling result; and calculating a similarity evaluation index of the template and the secondary labeling result, and expanding the training data set by using available samples of which the similarity evaluation index is greater than a fourth threshold value.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a category of a target object in the sample data to be processed, and the data filtering module is configured to: and when the category of the target object in the processing result is inconsistent with the category of the primary labeling target object, taking the corresponding sample to be processed as the available sample.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a sample data processing method as described in the first aspect of the embodiments above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the sample data processing method as described in the first aspect of the embodiments above.
According to the sample data processing method, the sample data processing device, the electronic equipment and the computer readable medium, the sample data to be processed are marked for the first time, the available samples meeting the requirements are selected from the sample data to be processed by using the screening model, and then the available samples are marked for the second time so as to expand the training data set, so that the samples containing the marks are obtained, the samples do not need to be marked manually, the sample marking efficiency can be improved, the samples can be screened to obtain high-quality sample data meeting the requirements, and the training efficiency and the training accuracy of the machine learning model are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
fig. 1 schematically shows an exemplary system architecture diagram of a sample data processing method or a sample data processing apparatus applied to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a sample data processing method according to an embodiment of the present disclosure;
FIG. 3 schematically shows a flow diagram of a sample data processing method according to another embodiment of the present disclosure;
fig. 4 schematically shows a block diagram of a sample data processing apparatus according to an embodiment of the present disclosure;
FIG. 5 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In this specification, the terms "a", "an", "the", "said" and "at least one" are used to indicate the presence of one or more elements/components/etc.; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first," "second," "third," and the like are used merely as labels, and are not limiting as to the number of their objects.
The following detailed description of exemplary embodiments of the disclosure refers to the accompanying drawings.
Fig. 1 is a schematic diagram showing a system architecture of an exemplary application environment to which a sample data processing method or a sample data processing apparatus according to an embodiment of the present disclosure can be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to desktop computers, portable computers, smart phones and tablets, wearable devices, virtual reality devices, smart homes, and the like.
The server 105 may be a server that provides various services, such as a background management server that provides support for devices operated by users using the terminal apparatuses 101, 102, 103. The background management server can analyze and process the received data such as the request and feed back the processing result to the terminal equipment.
For example, server 105 may train to obtain a screening model, e.g., using a training data set that includes annotations; acquiring sample data to be processed, and carrying out primary labeling on the sample data to be processed; the server 105 may also output a processing result of the sample to be processed, for example, through the screening model, and screen the sample to be processed according to the processing result to obtain an available sample in the sample to be processed; and carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into a training data set.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, and the like.
The sample data processing method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the sample data processing apparatus is generally disposed in the server 105. However, it is easily understood by those skilled in the art that the sample data processing method provided in the present disclosure may also be executed by the terminal devices 101, 102, and 103, and accordingly, the sample data processing apparatus may also be disposed in the terminal devices 101, 102, and 103, which is not particularly limited in this exemplary embodiment.
Based on this, the embodiment of the present disclosure provides a technical solution of a sample data processing method, which can automatically label sample data, screen labels by using a screening model, obtain available samples meeting requirements, and perform secondary labeling on the available samples, so as to improve the accuracy of labeling, thereby improving the efficiency and accuracy of model training.
As shown in fig. 2, the sample data processing method provided in the embodiment of the present disclosure may include the following steps:
step S1: and training by using the training data set containing the labels to obtain the screening model.
Step S2: obtaining sample data to be processed, and carrying out primary labeling on the sample data to be processed.
Step S3: and outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence coefficient, and extracting an available sample in the sample to be processed according to the processing result, wherein the confidence coefficient output by the screening model on the available sample is smaller than a first threshold value.
Step S4: and carrying out secondary labeling on the available samples, and expanding the available samples containing the secondary labeling into the training data set.
Specific embodiments of the respective steps in this embodiment are described in detail below.
In step S1, a screening model is obtained by training using a training data set containing labels.
The training data set may contain a variety of categories of data, such as images, text, audio, etc., for example. The training dataset may include hundreds of thousands, millions, and the like of images, and each image may include a target object and an annotation of the target object. The target object may include a road sign, such as a traffic light, a directional arrow of a road surface drawing, or the like; but may also include buildings, vehicles, roads, etc.; the label may be, for example, a category label of the target object, a position label of the target object in the image, or an outline label of the target object, which is not limited in this embodiment.
The screening model can be obtained by using a machine learning algorithm with a training data set containing labels. The screening model can be used for identifying the image, or can be used for identifying the position of a specific target in the image, and can also be used for identifying the outline of the specific target in the image; the screening model may also be used for other recognition tasks, e.g., the screening model may be used to recognize text in an image, etc., depending on the actual training objective. The method for training the screening model may adopt clustering, random forest, convolutional neural network, etc., and the embodiment is not limited thereto. For example, an image including a road sign is collected as a training data set, wherein the label can be the type of the road sign, the clustering model is trained by using the training data set, and the screening model can be obtained after the training is completed.
In step S2, sample data to be processed is acquired, and the sample data to be processed is primarily labeled.
The sample data to be processed refers to data which does not contain an annotation, such as a landmark image and the like which do not contain an annotation. For example, the initial labeling may first obtain a map element corresponding to the target object from the high-precision map, and then reversely identify the target object in the sample data to be processed by using the obtained map element, so as to label the target object and complete the automatic initial labeling. After the labeling, a sample to be processed containing the label can be obtained.
Next, in step S3, the processing result of the sample data to be processed is output through the filtering model, and then the available samples in the sample data to be processed are extracted according to the processing result.
In the embodiment, the sample data to be processed is used as the input of the screening model, and the sample to be processed is identified through the screening model, so that the processing result of the sample data to be processed is obtained. The processing result comprises the confidence coefficient output by the sample to be processed. The confidence degree can refer to the probability that the screening model identifies the sample data to be processed correctly, and the higher the confidence degree is, the more credible the processing result of the screening model on the sample data to be processed is. After the processing result of the sample data to be processed is obtained, the sample to be processed can be screened by using the confidence coefficient contained in the processing result, and the data with the confidence coefficient smaller than the first threshold value in the sample data to be processed is extracted and used as an available sample. For example, if the processing result output by the screening model is the category of the target object in the sample data to be processed, the confidence level in the processing result is the probability that the target object in the sample data to be processed is a specific category, and the higher the confidence level is, the more accurate the processing result is. For sample data to be processed, if the confidence coefficient of the output of the screening model to the sample data to be processed is higher, the more accurate the mapping relation of the screening model to the sample data to be processed is, the smaller the value of the optimization of the sample data to be processed to the screening model is; if the confidence coefficient of the screening model to the output of the sample data to be processed is smaller, the sample data to be processed and the data in the training data set of the screening model may have larger difference, and the higher the optimization value of the sample to be processed on the screening model is, the sample to be processed can be used as an available sample. The first threshold may be 0.5, 0.4, 0.3, or 0.2, 0.35, and the like, but the present embodiment is not limited thereto.
In an exemplary embodiment, the processing result output by the screening model may also include a position or a contour of the target object in the sample data to be processed. Correspondingly, the primary labeling of the sample data to be processed may include: marking the position of a target object in sample data to be processed; or labeling the contour of the target object in the sample data to be processed.
When the primary labeling is to label the position of the target object in the sample data to be processed, the processing result output by the screening model may include the position of the target object. When the position of the target object is included in the processing result, the manner of extracting the available sample in the sample to be processed may include: and extracting the sample data to be processed, of which the deviation between the position of the target object in the processing result and the position in the primary labeling of the sample to be processed is greater than a second threshold value, as an available sample. If the position of the filtering model in the processing result identified by the sample data to be processed has a larger deviation from the position of the initial label of the sample data to be processed, the sample data to be processed can be used as data with an optimized value for the filtering model, so that the sample data to be processed with the larger deviation between the position of the processing result and the initial label can be used as an available sample. The second threshold may include 0.5, 0.4, etc., or may include other values, such as 0.6, 0.55, 0.45, etc., which is not limited in this embodiment.
When the initial labeling is performed to label the contour of the target object in the sample data to be processed, the processing result may include the contour output by the screening model on the target object in the sample data to be processed. When the processing result includes the contour of the target object, the manner of extracting the available sample in the sample to be processed may include: and extracting sample data of which the deviation between the contour in the processing result of the sample data to be processed and the contour of the target object in the primary annotation is larger than a third threshold value as an available sample. The third threshold may include 0.4, 0.3, etc., and may also be determined according to actual requirements, for example, 0.45, 0.55, etc. It is to be understood that the second threshold and the third threshold may be the same, for example, both 0.5; it may also be different, for example, the second threshold value is 0.55, the third threshold value is 0.5, and so on; this embodiment is not limited to this.
With continued reference to fig. 2, in step S4, the available samples are secondarily labeled with the primary labels, and the available samples containing the secondary labels are extended to the training data set.
The available samples are sample data with small confidence coefficient output by the screening model and can be understood as data lacking in a training data set of the screening model, so that the available samples are labeled again and added into the training data set, the training data set can be improved, and a model with higher accuracy can be obtained by utilizing the training data set after the expansion is completed.
The primary labeling can be utilized when the available samples are subjected to secondary labeling, and the primary labeling is optimized, so that the accuracy of labeling is improved. For example, the first labeling is an example of an outline of the target object. And taking the contour marked by the target object through the primary marking as an initialized area, and performing curve evolution by using a level set segmentation model so as to optimize the contour marked primarily to obtain secondary marking. When the mark is a position, level set evolution can be performed based on the position in the primary mark, so that an accurate contour of the target object in the available sample is obtained, and the position of the target object in the available sample is calculated.
After the primary labeling is optimized to obtain the secondary labeling, the available samples containing the secondary labeling can be added into the training data set, so that more comprehensive sample data can be obtained.
In an exemplary embodiment, after the available samples are labeled for the second time, the available samples can be screened again, and the available samples meeting the conditions are extracted and expanded into the training data set. Because the primary labeling can label the category of the target object, based on this, the secondary labeling of the available samples to expand the training data set may include the following ways: firstly, a secondary labeling result obtained by secondary labeling can be obtained; obtaining a template corresponding to the type according to the type of the target object in the primary labeling; and then calculating similarity evaluation indexes of the template and the secondary labeling result, and extracting available samples with the similarity evaluation indexes larger than a fourth threshold value and expanding the available samples to a training data set.
The template may be preset according to each category of the target object, for example, if the category of the target object is a car, an image of the car may be obtained in advance as the template, and if the category of the target object is a truck, an image of the truck may be saved as the template, so as to obtain a template corresponding to each category. Similarity evaluation indexes between the template and the secondary labeling result can be calculated by using a cosine similarity calculation method, an Euclidean distance and other similarity calculation algorithms, and in addition, other algorithms can be adopted for calculating the similarity evaluation indexes, such as the Mahalanobis distance and the like, and the embodiment is not limited to the above.
In an exemplary embodiment, after performing the secondary labeling expansion on the available samples to the training data set, step S5 and step S6 may be further included, as shown in fig. 3. Wherein:
step S5: and updating the screening model through the expanded training data set, and updating the sample to be processed.
Step S6: and circularly executing the steps S1 to S5 through the updated screening model and the updated to-be-processed sample.
The method comprises the steps of selecting a training data set, training a model, and updating the training data set, wherein the training data set is expanded to train the screening model again, and the training data set is newly added with sample data after secondary marking, so that the training efficiency of the model can be improved and the accuracy of the trained model can be improved by utilizing the updated training data set for the screening model and other machine learning models. Meanwhile, a batch of data can be obtained again to serve as sample data to be processed, so that the sample data to be processed is updated; or deleting the available samples from the sample data to be processed, and updating the sample data to be processed. And circularly executing the steps S1 to S5 by using the updated sample data to be processed and the updated screening model, thereby continuously updating the screening model, continuously screening available samples from the updated sample data to be processed and expanding the available samples into a training data set, so that the data in the training data set is more and more comprehensive, and the effectiveness of the training data set is improved. In the machine learning process, various models, such as training image recognition models, character recognition models, road recognition models, and the like, can be trained by using the training data set according to actual requirements. Because the training data set contains sample data with higher labeling quality after secondary labeling, the training data set has good sample foundation for various machine learning models, thereby shortening the training period and improving the training efficiency and the accuracy of the models.
Evaluation indexes of the screening model can be calculated after the screening model is updated every time, and whether the circulation needs to be stopped or not is determined through the evaluation indexes. For example, the accuracy of the screening model is calculated, the quality of the screening model is evaluated through the accuracy, if the accuracy exceeds a certain threshold, the screening model can be determined to meet the conditions, circulation is exited, and a final training data set is obtained and stored. If the screening model meets the preset conditions, the data in the training data set can be understood to meet the labeling requirements, so that the data with the labeling precision meeting the requirements can be obtained, the time cost of manpower labeling is saved, and the accuracy of automatic labeling can be improved.
Further, the present embodiment also provides a sample data processing apparatus, which can be used to execute the sample data processing method of the present disclosure. Referring to fig. 4, the sample data processing apparatus 40 provided in the embodiment of the present disclosure may include: a model training module 41, a sample labeling module 42, a data screening module 43, and a sample data determination module 44.
The model training module 41 is configured to train to obtain a screening model by using a training data set including labels.
And the sample labeling module 42 is configured to obtain sample data to be processed, and label the sample data to be processed for the first time.
A data screening module 43, configured to output a processing result of the sample data to be processed through the screening model, where the processing result includes a confidence level, and extract an available sample in the sample to be processed according to the processing result, where the confidence level output by the screening model for the available sample is smaller than a first threshold.
And a sample data determining module 44, configured to perform secondary labeling on the available sample by using the primary labeling, and expand the available sample including the secondary labeling into the training data set.
In an exemplary embodiment of the present disclosure, the sample data processing apparatus further includes a data updating module 45 and a loop module 46.
The data updating module 45 is configured to update the screening model through the expanded training data set, and update the sample data to be processed.
And a circulation module 46, configured to circulate the model training module 41 to the data updating module 45 through the updated screening model and the updated to-be-processed sample data.
In an exemplary embodiment of the present disclosure, the loop module 46 may be configured to: and when the evaluation index of the screening model meets a preset condition, stopping executing the circulation.
In an exemplary embodiment of the present disclosure, the processing result further includes a position or a contour of a target object in the sample data to be processed.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a position of a target object in the sample data to be processed, and when the processing result includes the position of the target object in the sample data to be processed, the data filtering module 43 is configured to: and extracting sample data to be processed, of which the deviation between the position of the processing result output by the screening model and the position marked for the first time is greater than a second threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a contour of a target object in the sample data to be processed, and when the processing result includes the contour of the target object in the sample data to be processed, the data filtering module 43 is configured to: and extracting the sample data to be processed, of which the deviation between the contour of the processing result output by the screening model and the contour marked for the first time is larger than a third threshold value, as the available sample.
In an exemplary embodiment of the present disclosure, the processing result includes a category of a target object in the sample data to be processed; the sample data determination module 44 may be configured to: obtaining a template corresponding to the target object according to the category of the target object; carrying out secondary labeling on the available samples to obtain a secondary labeling result; and calculating a similarity evaluation index of the template and the secondary labeling result, and expanding the training data set by using available samples of which the similarity evaluation index is greater than a fourth threshold value.
In an exemplary embodiment of the present disclosure, the primary labeling includes labeling a category of a target object in the sample data to be processed, and the data filtering module is configured to: and when the category of the target object in the processing result is inconsistent with the category of the primary labeling target object, taking the corresponding sample to be processed as the available sample.
For details which are not disclosed in the embodiments of the present disclosure, please refer to the embodiments of the sample data processing method disclosed above for details which are not disclosed in the embodiments of the present disclosure.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing the electronic devices of embodiments of the present disclosure. The computer system 500 of the electronic device shown in fig. 5 is only an example, and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 804. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device is enabled to implement the epidemic situation prevention and control validity determination method in the embodiment.
For example, the electronic device may implement the following as shown in fig. 2: step S1, training by using a training data set containing labels to obtain a screening model; step S2, obtaining sample data to be processed, and carrying out primary labeling on the sample data to be processed; step S3, outputting the processing result of the sample data to be processed through the screening model, wherein the processing result includes a confidence level, extracting an available sample in the sample to be processed according to the processing result, and the confidence level of the screening model for the output of the available sample is smaller than a first threshold value; and step S4, carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set.
As another example, the electronic device may implement the steps shown in fig. 3.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (11)

1. A sample data processing method is characterized by comprising the following steps:
step S1: training by using a training data set containing labels to obtain a screening model;
step S2: acquiring sample data to be processed, and carrying out primary labeling on the sample data to be processed;
step S3: outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence level, extracting an available sample from the sample to be processed according to the processing result, and the confidence level of the screening model for the output of the available sample is smaller than a first threshold value;
step S4: and carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set.
2. The method of claim 1, further comprising:
step S5: updating the screening model through the expanded training data set, and updating the sample data to be processed;
step S6: and circularly executing the steps S3-S5 through the updated screening model and the updated sample data to be processed.
3. The method according to claim 1, wherein in the step S6, when the evaluation index of the screening model satisfies a preset condition, the execution of the loop is stopped.
4. The method according to claim 1, wherein the processing result in step S3 further includes a position or an outline of a target object in the sample data to be processed.
5. The method according to claim 4, wherein the primary labeling comprises labeling a position of a target object in the sample data to be processed, and when the processing result comprises the position of the target object in the sample data to be processed, the extracting available samples in the sample data to be processed according to the processing result comprises:
and extracting sample data to be processed, of which the deviation between the position of the processing result output by the screening model and the position marked for the first time is greater than a second threshold value, as the available sample.
6. The method according to claim 4, wherein the primary labeling comprises labeling a contour of a target object in the sample data to be processed, and when the processing result comprises a contour of a target object in the sample data to be processed, the extracting available samples in the sample data to be processed according to the processing result comprises:
and extracting sample data to be processed, of which the deviation between the contour of the processing result output by the screening model and the contour labeled for the first time is greater than a third threshold value, as the available sample.
7. The method according to claim 6, wherein the processing result comprises a category of a target object in the sample data to be processed;
the secondarily labeling the available samples, and expanding the available samples containing the secondary labels to the training data set, comprises:
obtaining a template corresponding to the target object according to the category of the target object;
carrying out secondary labeling on the available samples to obtain a secondary labeling result;
and calculating a similarity evaluation index of the template and the secondary labeling result, and expanding the training data set by using available samples of which the similarity evaluation index is greater than a fourth threshold value.
8. The method of claim 7, wherein the primary labeling comprises labeling a category of a target object in the sample data to be processed, and wherein extracting available samples in the sample data to be processed according to the processing result comprises:
and when the category of the target object in the processing result is inconsistent with the category of the primary labeling target object, taking the corresponding sample to be processed as the available sample.
9. A sample data processing apparatus, comprising:
the model training module is used for training by utilizing a training data set containing labels to obtain a screening model;
the sample marking module is used for acquiring sample data to be processed and marking the sample data to be processed for the first time;
the data screening module is used for outputting a processing result of the sample data to be processed through the screening model, wherein the processing result comprises a confidence coefficient, and an available sample in the sample to be processed is extracted according to the processing result, wherein the confidence coefficient output by the screening model on the available sample is smaller than a first threshold value;
and the sample data determining module is used for carrying out secondary labeling on the available samples by utilizing the primary labeling, and expanding the available samples containing the secondary labeling into the training data set.
10. An electronic device, comprising:
one or more processors;
a storage device to store one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the sample data processing method of any of claims 1 to 8.
11. A computer readable medium having stored thereon a computer program, which when executed by a processor implements the sample data processing method of any of claims 1 to 8.
CN202011324242.8A 2020-11-23 2020-11-23 Sample data processing method and device, electronic equipment and medium Pending CN114548192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011324242.8A CN114548192A (en) 2020-11-23 2020-11-23 Sample data processing method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011324242.8A CN114548192A (en) 2020-11-23 2020-11-23 Sample data processing method and device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN114548192A true CN114548192A (en) 2022-05-27

Family

ID=81660245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011324242.8A Pending CN114548192A (en) 2020-11-23 2020-11-23 Sample data processing method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN114548192A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151491A (en) * 2023-04-20 2023-05-23 天津港电力有限公司 Intelligent power failure prediction platform based on power data
CN116756576A (en) * 2023-08-17 2023-09-15 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium
CN118035444A (en) * 2024-02-20 2024-05-14 安徽彼亿网络科技有限公司 Information extraction method and device based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596338A (en) * 2018-05-09 2018-09-28 四川斐讯信息技术有限公司 A kind of acquisition methods and its system of neural metwork training collection
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN111104479A (en) * 2019-11-13 2020-05-05 中国建设银行股份有限公司 Data labeling method and device
CN111783518A (en) * 2020-05-14 2020-10-16 北京三快在线科技有限公司 Training sample generation method and device, electronic equipment and readable storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596338A (en) * 2018-05-09 2018-09-28 四川斐讯信息技术有限公司 A kind of acquisition methods and its system of neural metwork training collection
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN111104479A (en) * 2019-11-13 2020-05-05 中国建设银行股份有限公司 Data labeling method and device
CN111783518A (en) * 2020-05-14 2020-10-16 北京三快在线科技有限公司 Training sample generation method and device, electronic equipment and readable storage medium
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151491A (en) * 2023-04-20 2023-05-23 天津港电力有限公司 Intelligent power failure prediction platform based on power data
CN116151491B (en) * 2023-04-20 2023-07-18 天津港电力有限公司 Intelligent power failure prediction platform based on power data
CN116756576A (en) * 2023-08-17 2023-09-15 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium
CN116756576B (en) * 2023-08-17 2023-12-12 阿里巴巴(中国)有限公司 Data processing method, model training method, electronic device and storage medium
CN118035444A (en) * 2024-02-20 2024-05-14 安徽彼亿网络科技有限公司 Information extraction method and device based on big data

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
CN108280477B (en) Method and apparatus for clustering images
CN114548192A (en) Sample data processing method and device, electronic equipment and medium
CN108628830B (en) Semantic recognition method and device
CN111709240A (en) Entity relationship extraction method, device, equipment and storage medium thereof
CN111259112B (en) Medical fact verification method and device
CN111104482A (en) Data processing method and device
CN109918513B (en) Image processing method, device, server and storage medium
CN111209478A (en) Task pushing method and device, storage medium and electronic equipment
CN110209782B (en) Question-answering model and answer sentence generation method and device, medium and electronic equipment
CN114494709A (en) Feature extraction model generation method, image feature extraction method and device
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN110674208A (en) Method and device for determining position information of user
CN111291715A (en) Vehicle type identification method based on multi-scale convolutional neural network, electronic device and storage medium
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN112749293A (en) Image classification method and device and storage medium
CN113806485B (en) Intention recognition method and device based on small sample cold start and readable medium
CN116166858A (en) Information recommendation method, device, equipment and storage medium based on artificial intelligence
CN112308090B (en) Image classification method and device
CN113515591B (en) Text defect information identification method and device, electronic equipment and storage medium
CN114882283A (en) Sample image generation method, deep learning model training method and device
CN112417260B (en) Localized recommendation method, device and storage medium
CN113569929A (en) Internet service providing method and device based on small sample expansion and electronic equipment
CN111797183A (en) Method and device for mining road attribute of information point and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination