CN110162649B

CN110162649B - Sample data acquisition method, acquisition system, server and computer readable medium

Info

Publication number: CN110162649B
Application number: CN201910441621.6A
Authority: CN
Inventors: 杨大陆; 孙旭; 杨叶辉; 王磊; 许言午; 黄艳
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2021-06-18
Anticipated expiration: 2039-05-24
Also published as: CN110162649A

Abstract

The present disclosure provides a sample data obtaining method, including: constructing a mother sample picture database; sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets; aiming at each mother sample picture set, extracting a plurality of sub sample pictures from each mother sample picture in the mother sample picture set by adopting a selecting frame with a preset size, and endowing each sub sample picture with a primary class mark to obtain a sub sample picture set corresponding to each mother sample picture set; for each sub-sample picture set, all sub-sample pictures contained in the sub-sample picture set are taken as training sample data, and a sample classification model corresponding to each sub-sample picture set is trained; and respectively inputting the sub-sample picture into each sample classification model aiming at each sub-sample picture, and selecting a classification result with the largest frequency as a calibration class mark of the sub-sample picture.

Description

Sample data acquisition method, acquisition system, server and computer readable medium

Technical Field

The present disclosure relates to the field of deep learning, and in particular, to a sample data acquisition method, an acquisition system, a server, and a computer-readable medium.

Background

When training a detection model for a specific task based on a Deep Learning (Deep Learning) technique, a large amount of training sample data with calibration classes needs to be collected in advance.

However, in practice it has been found that for some special tasks it is difficult to obtain a large number of small-sized samples that are labeled (with the calibration category); for example, in a focus detection task for a fundus picture, in order to detect whether a focus exists in the fundus picture by using a trained detection model and locate the focus position when the focus exists, a large amount of focus labeling sample data at a small size (patch) level or a pixel level needs to be acquired, and currently, sample collection can only be performed in a manner of manually selecting and labeling the focus in the fundus picture. The manual sampling method has the following problems: 1) due to the complex lesion form, the doctor labeling has strong subjectivity, the boundary division is random, and a professional ophthalmologist is difficult to define the attribute problem of the lesion boundary pixels, namely the labeling difficulty is high; 2) the time consumption for manual labeling by doctors is long, and the cost for obtaining the calibration type of the sample is high, namely, a large amount of samples are difficult to obtain.

Disclosure of Invention

The present disclosure is directed to at least one of the technical problems in the prior art, and provides a sample data obtaining method, a sample data obtaining system, a server, and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a sample data obtaining method, including:

constructing a mother sample picture database, wherein the mother sample picture database comprises: a plurality of mother sample pictures with calibration class marks;

sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, wherein each mother sample picture set comprises a plurality of mother sample pictures;

for each mother sample picture set, extracting a plurality of sub sample pictures from each mother sample picture in the mother sample picture set by using a selecting frame with a preset size, and endowing each sub sample picture with a preliminary class mark to obtain a sub sample picture set corresponding to each mother sample picture set, wherein the preliminary class mark of the sub sample picture is a calibration class mark of the mother sample picture to which the sub sample picture belongs, and the size of the selecting frame is smaller than that of the mother sample picture;

for each sub-sample picture set, training a sample classification model corresponding to each sub-sample picture set by taking all the sub-sample pictures contained in the sub-sample picture set and the preliminary class marks corresponding to each sub-sample picture as training sample data;

and respectively inputting the sub-sample picture into each sample classification model aiming at each sub-sample picture in each sub-sample picture set so that each sample classification model respectively outputs a corresponding classification result, and selecting the classification result with the largest frequency as a calibration class mark of the sub-sample picture.

In some embodiments, the parent sample picture is square in shape;

the shape of the marquee is square;

and the ratio of the side length of the selection frame to the side length of the mother sample picture is equal to a first preset coefficient q, wherein q is more than 0 and less than 1.

In some embodiments, after the step of inputting, for each sub-sample picture in each sub-sample picture set, the sub-sample picture into each sample classification model respectively, so that each sample classification model outputs a corresponding classification result respectively, and selecting the classification result with the largest frequency as the calibration class label of the sub-sample picture, the method further includes:

judging whether the side length of the sub-sample picture is less than or equal to a preset length threshold value or not;

when the side length of the sub-sample picture is judged to be less than or equal to the preset length threshold, the process is ended;

and when the side length of the sub-sample picture is judged to be larger than the preset length threshold, taking the sub-sample picture with the calibration class mark as a new mother sample picture, constructing a new mother sample picture database, and continuously performing the step of sampling the mother sample picture database for multiple times based on the new mother sample picture database to obtain a plurality of corresponding mother sample picture sets.

monitoring each sub-sample picture in each sub-sample picture set, inputting the sub-sample picture into each sample classification model respectively so that each sample classification model outputs a corresponding classification result respectively, and selecting the classification result with the largest frequency as whether the cycle execution accumulated frequency of the step of calibrating the class mark of the sub-sample picture reaches a preset frequency threshold value or not;

when the number of the accumulated times of the circulation execution is monitored to be not equal to the preset number threshold value, taking the sub-sample picture with the calibration class mark as a new mother sample picture, constructing a new mother sample picture database, and continuously performing the step of sampling the mother sample picture database for multiple times based on the new mother sample picture database to obtain a plurality of corresponding mother sample picture sets;

and when the accumulated number of times of the circulation execution reaches the preset number threshold value, ending the process.

In some embodiments, the first predetermined coefficient q satisfies: q is more than or equal to 0.5 and less than or equal to 0.7.

In some embodiments, in the step of sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, the number of mother sample pictures included in each mother sample picture set is equal;

the ratio of the number of the mother sample pictures contained in one mother sample picture set to the number of the mother sample pictures contained in the mother sample picture database is equal to a second predetermined coefficient p, wherein 0 < p < 1.

In some embodiments, the second predetermined coefficient p satisfies: p is more than or equal to 0.4 and less than or equal to 0.6.

In some embodiments, in the step of extracting a plurality of sub-sample pictures from each of the mother sample pictures in the mother sample picture set by using a cull box with a predetermined size, the number of the sub-sample pictures extracted from one mother sample picture is a predetermined number N;

wherein the predetermined number N is a positive integer, and N is more than or equal to 3 and less than or equal to 10.

In some embodiments, the step of constructing the mother sample picture database includes:

collecting a plurality of original sample pictures with calibration class marks;

carrying out size adjustment processing on the original sample picture so as to unify the size of the original sample picture;

and taking the original sample picture subjected to the size adjustment as a mother sample picture to construct a mother sample picture database.

In a second aspect, an embodiment of the present disclosure further provides a sample data acquiring system, including:

the first construction module is used for constructing a mother sample picture database, and the mother sample picture database comprises: a plurality of mother sample pictures with calibration class marks;

the sampling module is used for sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, and each mother sample picture set comprises a plurality of mother sample pictures;

the extraction module is used for extracting a plurality of sub-sample pictures from each mother sample picture in the mother sample picture set by adopting a selecting frame with a preset size aiming at each mother sample picture set, and endowing each sub-sample picture with a preliminary class mark to obtain the sub-sample picture set corresponding to each mother sample picture set, wherein the preliminary class mark of the sub-sample picture is a calibration class mark of the mother sample picture to which the sub-sample picture belongs, and the size of the selecting frame is smaller than that of the mother sample picture;

a training module, configured to train, for each sub-sample picture set, a sample classification model corresponding to each sub-sample picture set by using, as training sample data, all the sub-sample pictures included in the sub-sample picture set and a preliminary class label corresponding to each sub-sample picture;

and the processing module is used for respectively inputting each sub-sample picture in each sub-sample picture set into each sample classification model so that each sample classification model respectively outputs a corresponding classification result, and selecting the classification result with the largest frequency as a calibration label of the sub-sample picture.

In some embodiments, the parent sample picture is square in shape;

the shape of the marquee is square;

In some embodiments, further comprising:

the judging module is used for judging whether the side length of each sub sample picture is less than or equal to a preset length threshold value or not after the processing module determines the calibration class mark of each sub sample picture in each sub sample picture set;

the second construction module is used for constructing a new mother sample picture database by taking the sub sample picture with the calibration class mark as a new mother sample picture when the judgment module judges that the side length of the sub sample picture is greater than the preset length threshold, and controlling the sampling module to continuously execute corresponding processing based on the new mother sample picture database;

and the first control module is used for controlling the sample data acquisition system to stop working when the judging module judges that the side length of the sub-sample picture is less than or equal to the preset length threshold.

In some embodiments, further comprising:

the monitoring module is used for monitoring whether the accumulated circulating execution times of the processing module reaches a preset time threshold value or not after the processing module determines the calibration type mark of each sub sample picture in each sub sample picture set;

the third construction module is used for constructing a new mother sample picture database by taking the sub-sample picture with the calibration class mark as a new mother sample picture when the monitoring module monitors that the accumulated number of times of the circulating execution does not reach the preset number threshold, and controlling the sampling module to continuously execute corresponding processing based on the new mother sample picture database;

and the second control module is used for controlling the sample data acquisition system to stop working when the monitoring module monitors that the accumulated times of the cyclic execution reaches the preset time threshold.

In some embodiments, in the process that the sampling module samples the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, the number of mother sample pictures included in each mother sample picture set is equal;

In some embodiments, in the process that the extraction module extracts a plurality of sub-sample pictures from each of the mother sample pictures in the mother sample picture set by using a culling box with a predetermined size, the number of the sub-sample pictures extracted from one mother sample picture is a predetermined number N;

In some embodiments, the first building block comprises:

the acquisition unit is used for acquiring a plurality of original sample pictures with calibration type marks;

the size adjusting unit is used for carrying out size adjustment processing on the original sample picture so as to unify the size of the original sample picture;

and the construction unit is used for taking the original sample picture subjected to the size adjustment processing as a mother sample picture so as to construct a mother sample picture database.

In a third aspect, an embodiment of the present disclosure further provides a server, including:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method as provided by any of the preceding embodiments.

In a fourth aspect, the disclosed embodiments also provide a computer readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as provided in any of the foregoing embodiments.

The present disclosure has the following beneficial effects:

the embodiment of the disclosure provides a sample data acquisition method, which can extract a large number of small-sized sample pictures from large-sized sample pictures and automatically label the small-sized sample pictures.

Drawings

Fig. 1 is a flowchart of a sample data obtaining method according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating an implementation of step S1 in the present disclosure;

fig. 3 is a flowchart of another sample data obtaining method provided in the embodiment of the present disclosure;

fig. 4 is a flowchart of another sample data obtaining method provided in the embodiment of the present disclosure;

fig. 5 is a block diagram of a sample data acquiring system according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a first building block of the present disclosure;

fig. 7 is a block diagram of another sample data acquiring system according to an embodiment of the present disclosure;

fig. 8 is a block diagram of another sample data acquiring system according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present disclosure, a sample data acquiring method, an acquiring system, a server and a computer readable medium provided by the present disclosure are described in detail below with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. Thus, a first element, component, or section discussed below could be termed a second element, component, or section without departing from the teachings of the present disclosure.

Embodiments described herein may be described with reference to plan and/or cross-sectional views in light of idealized schematic illustrations of the disclosure. Accordingly, the example illustrations can be modified in accordance with manufacturing techniques and/or tolerances. Accordingly, the embodiments are not limited to the embodiments shown in the drawings, but include modifications of configurations formed based on a manufacturing process. Thus, the regions illustrated in the figures have schematic properties, and the shapes of the regions shown in the figures illustrate specific shapes of regions of elements, but are not intended to be limiting.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The sample data acquisition method is used for acquiring marking sample data aiming at a preset task, wherein the marking sample data can be positive sample data aiming at the preset task or negative sample data aiming at the preset task; the predetermined task may be any task applicable to the deep learning technology, such as a segmentation task, a classification task, a positioning task, an identification task, and the like, and the specific type of the predetermined task is not limited by the technical scheme of the present disclosure.

In addition, the "labeling sample data" in the present disclosure refers to a picture sample with calibration type labels, the type and number of the calibration type labels are manually set in advance according to specific predetermined tasks; for example, if the predetermined task is a focus detection task for a fundus image, the calibration type mark may be set as two types, i.e., a "focus" sample and a "non-focus" sample, or the calibration type mark may be further refined as needed so that a subsequently trained detection model can identify a specific type of a focus, for example, the calibration type mark may be set as a plurality of types, i.e., a "bleeding spot type focus" sample, a "oozing out type focus" sample, a "lint spot type focus" sample … … "non-focus" sample, and the like. It should be noted that, in the technical solution of the present disclosure, neither the type nor the number of the "calibration type labels" is limited.

Fig. 1 is a flowchart of a sample data obtaining method according to an embodiment of the present disclosure, as shown in fig. 1.

Step S1, constructing a mother sample picture database, the mother sample picture database including: and multiple mother sample pictures with calibration type marks.

Fig. 2 is a flowchart illustrating a specific implementation of step S1 in the present disclosure, and as shown in fig. 2, step S1 includes:

step S101, collecting a plurality of original sample pictures with calibration type marks.

In the present disclosure, raw sample pictures refer to large-sized sample pictures that have been labeled (with a calibration category) for a predetermined task, and these raw sample pictures have not been subjected to any processing. In addition, in practical application, the acquisition difficulty and the importance degree of the positive sample data for the predetermined task are far greater than those of the negative sample data, so that the positive sample data for the predetermined task should be acquired as much as possible. For this purpose, the selected original sample picture should be selected as the picture corresponding to the positive sample in the calibration category as much as possible.

In order to facilitate a better understanding of the technical solutions of the present disclosure, the following description will be exemplarily made by taking a case where a predetermined task is a focus detection task for a fundus image and a preset calibration class mark includes two classes of "focus" and "non-focus". Wherein, the picture marked with the calibration class as 'focus' can be used as a positive sample, and the picture marked with the calibration class as 'non-focus' can be used as a negative sample. It should be understood by those skilled in the art that the above-described setting is only exemplary, and does not limit the technical solution of the present disclosure.

In step S101, a large-size fundus picture having a calibration class mark (which may be manually pre-labeled) may be taken as an original sample picture; of course, in order to obtain more positive samples as much as possible after the sample data acquisition method provided by the present disclosure is finished, a large-sized fundus image with a calibration type labeled as "focus" is selected as the original sample image.

Step S102, performing a size adjustment process on the original sample picture to unify the size of the original sample picture.

In step S102, considering that the sizes of different original sample pictures may be different, in order to facilitate subsequent uniform processing of different original sample pictures, a size adjustment (Resize) process needs to be performed on the original sample pictures to unify the sizes of the original sample pictures.

Taking the processing of the fundus image as an example, if the width of the fundus image is larger than the height, the left and right side portions of the fundus image can be cut out first, so that the shape of the fundus image becomes a square; and then, uniformly resetting the cut fundus images to be set in a size which can be designed and adjusted according to actual conditions. As an alternative embodiment, the fundus image after Resize processing is square in shape and has a size of H × H, where H is 1600 pixels.

The implementation of Resize processing on pictures to achieve size uniformity is conventional in the art and will not be described in detail here.

And step S103, taking the original sample picture after the size adjustment as a mother sample picture to construct a mother sample picture database.

In step S103, the original sample picture whose resizing is completed is used as a mother sample picture to construct a mother sample picture database, where the mother sample picture database includes: and multiple mother sample pictures with calibration type marks.

It should be noted that, the above mentioned condition of performing Resize processing on the original sample pictures to unify the sizes of the original sample pictures belongs to the preferred embodiment in the present disclosure, which can facilitate subsequent unified processing on each original sample picture, and improve the processing efficiency, and does not limit the technology of the present disclosure.

Step S2, performing multiple sampling on the mother sample picture database to obtain a plurality of corresponding mother sample picture sets, where each mother sample picture set includes a plurality of mother sample pictures.

In step S2, the mother sample picture database may be sampled a plurality of times by a random sampling method or a sampling method based on a certain rule, and the plurality of times may use a sample with or without a playback sample. Wherein, many female sample pictures of all gathering of sampling at every turn, many female sample pictures that sampling gathered constitute a female sample picture set at every turn.

As a specific alternative, a random sampling mode is adopted to perform multiple time replacement sampling on the database of the mother sample pictures, and the number of the mother sample pictures acquired by each sampling is equal; it should be noted that, in the parent sample picture set obtained by the above sampling method, an intersection may exist between different parent sample picture sets.

Further, assuming that the number of the mother sample pictures included in the mother sample picture database is denoted as C, the number of the mother sample pictures included in each mother sample picture set may be p × C, that is, the ratio of the number of the mother sample pictures included in one mother sample picture set to the number of the mother sample pictures included in the mother sample picture database is p, where p is greater than 0 and less than 1, and the specific value of p may be designed and adjusted according to actual situations.

It should be noted that the larger the value of p is, the larger the number of the same mother sample pictures contained in the two different mother sample picture sets is, the smaller the difference between the two different mother sample picture sets is, which is not favorable for the training and labeling in the subsequent steps S4 and S5; however, the smaller the value of p, the smaller the number of the mother sample pictures included in each mother sample picture set, which results in the smaller number of samples that can be finally obtained after the sample data obtaining method is finished. Considering the above factors in combination, it is preferred in this disclosure that 0.4. ltoreq. p.ltoreq.0.6; further preferably, p is 0.5.

In addition, the number of the mother sample picture sets obtained in step S2 is denoted as M, M is a preset positive integer greater than or equal to 2, and a specific value of M can be designed and adjusted according to actual conditions.

In this disclosure, in order to enable each parent sample picture in the parent sample picture database to be sampled into at least one parent sample picture set as much as possible, a value of M × p should be greater than 1, where a larger value of M × p increases a probability that the parent sample picture in the picture database can be sampled into the parent sample picture set, and certainly a larger value of M × p increases a throughput of a subsequent system. Taking the above factors into consideration, it is preferable in the present disclosure that the value of M × p satisfies 1 < M × p < 10.

Step S3, aiming at each mother sample picture set, extracting a plurality of sub sample pictures from each mother sample picture in the mother sample picture set by adopting a selecting frame with a preset size, and endowing each sub sample picture with a preliminary class mark to obtain the sub sample picture set corresponding to each mother sample picture set, wherein the preliminary class mark of the sub sample picture is the calibration class mark of the mother sample picture to which the sub sample picture belongs, and the size of the selecting frame is smaller than that of the mother sample picture.

In step S3, when multiple sub-sample pictures are extracted from one mother sample picture using the cull box, the sub-sample pictures may be extracted randomly or according to a certain rule, which all fall within the scope of the present disclosure. In addition, in a plurality of sub-sample pictures extracted from one mother sample picture, there may be partial overlapping of some sub-sample pictures, and this situation does not affect the technical solution of the present disclosure.

As an alternative, the shape of the mother sample picture is square; the shape of the selection frame is square; and assuming that the side length of the mother sample picture is H, the preset side length of the culling frame can be q × H, that is, the ratio of the side length of the culling frame to the side length of the mother sample picture is equal to a first predetermined coefficient q, wherein q is greater than 0 and less than 1, and the specific value of q can be designed and adjusted according to the actual situation.

It should be noted that, under the condition that the side length of the parent sample picture is fixed, if the value of q is larger, the larger the size of the selection frame is, the larger the size of the obtained child sample picture is, so that the requirement of the user on "small size" is difficult to meet; and q is smaller, the smaller the size of the selection box is, and the smaller the probability that the selection box can obtain the positive sample is. Considering the above factors together, in the present disclosure, it is preferable that the first predetermined coefficient q satisfies: q is more than or equal to 0.5 and less than or equal to 0.7. Further preferably, q is 0.6.

For convenience of description, assuming that N sub-sample pictures (N is a preset positive integer greater than 1) are extracted from each parent sample picture, for the parent sample pictures in a parent sample picture set, N × p C sub-sample pictures can be extracted in total, and the N × p C sub-sample pictures form a sub-sample picture set. Therefore, in step S3, M sub-sample picture sets can be obtained, and each sub-sample picture set includes N × p × C sub-sample pictures.

As an alternative, the predetermined number N satisfies: n is more than or equal to 3 and less than or equal to 10.

And configuring a corresponding preliminary class mark for each extracted sub-sample picture, wherein the preliminary class mark of the sub-sample picture is a calibration class mark of the mother sample picture to which the sub-sample picture belongs.

Step S4, for each sub-sample picture set, training a sample classification model corresponding to each sub-sample picture set by using all sub-sample pictures included in the sub-sample picture set and the preliminary class labels corresponding to the sub-sample pictures as training sample data.

In step S4, based on the deep learning technique, a sample classification model corresponding to the sub-sample picture set can be trained according to all sub-sample pictures in the sub-sample picture set and the preliminary class labels corresponding to the sub-sample pictures, and the sample classification model can be used to classify the input samples. It should be noted that the process of training a corresponding model according to a sample based on a deep learning technique is a conventional technique in the art and will not be described in detail here.

Through step S4, M sample classification models corresponding to the M sub-sample picture sets one-to-one can be trained.

Step S5, for each sub-sample picture in each sub-sample picture set, inputting the sub-sample picture into each sample classification model respectively, so that each sample classification model outputs a corresponding classification result, and selecting the classification result with the largest frequency as the calibration label of the sub-sample picture.

Based on the foregoing step S3, the set of M sub-sample pictures includes M × N × p × C sub-sample pictures with a size of q × H × q × H. In step S5, for each of M × N × p × C sub-sample pictures, the sub-sample picture is output to M sample classification models respectively, so as to obtain M classification results, and only one classification result with the largest frequency is selected as the calibration label of the sub-sample picture by the classification statistics of the M classification results. In step S5, a corresponding calibration class label is configured for each sub-sample picture in the M × N × p × C sub-sample pictures (automatically labeling the sub-sample pictures).

Based on the above, by performing the above steps S1 to S5 once, M × N × p × C sub-sample pictures with size q × H can be obtained from C large-size sample pictures with size H × H, and automatic labeling of M × N × p × C sub-sample pictures is realized. It should be noted that, in the above-mentioned M × N × p × C sub-sample picture, a part may be used as a positive sample, and a part may be used as a negative sample.

In the present disclosure, by performing the above-mentioned steps S2 to S5 in a loop, more sub-sample pictures with smaller size and automatically labeled can be obtained. The following description will be made in conjunction with specific embodiments.

Fig. 3 is a flowchart of another sample data obtaining method provided in the embodiment of the present disclosure, and as shown in fig. 3, the sample data obtaining method includes:

In step S1, the number of the mother sample pictures included in the mother sample picture database is C; the shape of the mother sample picture is square with side length H.

In step S2, the number of the parent sample picture sets is M, and the ratio of the number of the parent sample pictures included in each parent sample picture set to the number of the parent sample pictures included in the parent sample picture database is equal to a second predetermined coefficient p.

In step S3, the shape of the cull box is a square, and the ratio of the side length of the cull box to the side length of the parent sample picture is equal to a first predetermined coefficient q; and extracting N sub-sample pictures from each mother sample picture.

And step S6a, judging whether the side length of the sub-sample picture is less than or equal to a preset length threshold value.

In step S6a, the specific value of the predetermined length threshold is manually preset according to the size of the training sample picture required by the predetermined task. For example, when the predetermined task is a lesion detection task for a fundus picture, the predetermined length threshold may be designed to be 16 pixels in consideration that the ideal size of a required training sample picture should be less than or equal to 16 × 16 (unit: pixel).

When the step S6a determines that the side length of the sub-sample picture is less than or equal to the predetermined length threshold, it indicates that the size of the sub-sample picture obtained after the step S5 is executed last time meets the predetermined requirement, and each sub-sample picture obtained after the step S5 is executed last time can be used as a required training sample picture, and the process is ended; when the step S6a determines that the side length of the sub-sample picture is greater than the predetermined length threshold, the last time the step S5 is executed, the size of the acquired sub-sample picture is too large, and the process of extracting the small-size sub-sample picture needs to be continued, and thereafter the step S7a is executed.

And S7a, taking the sub-sample picture with the calibration class mark as a new mother sample picture, and constructing a new mother sample picture database.

After the step S7a is finished, the step S2 is executed again based on the new mother sample picture database to execute the steps S2 to S7a in a loop until the step S6a in a certain loop determines that the side length of the sub sample picture is less than or equal to the predetermined length threshold. It should be noted that, for specific descriptions of the steps S1 through S5 in this embodiment, reference may be made to corresponding contents in the foregoing embodiments, and details are not described here again.

In the above-mentioned process of executing steps S2 to S7a in a loop, when step S5 is completed i times, the number of the obtained sub-sample pictures with completed criteria is (M × N × p)ⁱC, the side length of each sub-sample picture is qⁱH, i are positive integers.

By the sample data acquisition method shown in fig. 3, small-sized sample pictures with side lengths less than or equal to the predetermined length threshold can be extracted from the large-sized sample pictures, and the small-sized sample pictures are automatically labeled. Meanwhile, the size of the finally obtained sub-sample picture can be controlled based on the 'predetermined length threshold'.

Fig. 4 is a flowchart of another sample data obtaining method provided by the embodiment of the present disclosure, and as shown in fig. 4, unlike the scheme in fig. 3 that the size of the finally obtained sub-sample picture is controlled based on the "predetermined length threshold", the embodiment shown in fig. 4 controls the size of the finally obtained sub-sample picture based on the accumulated number of times of loop execution of step S5. The sample data acquisition method comprises the following steps:

To monitor the accumulated number of loop executions of step S5, a variable constant i may be configured, wherein the variable constant i represents the accumulated number of loop executions of step S5. Before step S2 is executed, the cumulative number of times i of loop execution may be initialized, that is, i is set to 0; it should be noted that the operation of making i equal to 0 may be performed before step S1 (no corresponding figure is given) or between step S1 and step S2 (see fig. 4), which all belong to the protection scope of the present disclosure.

Note that, every time step S5 is executed once, i +1 is executed once to count the cumulative number of times of loop execution of step S5.

And step S6b, monitoring whether the accumulated number i of the loop executions of the step S5 reaches a preset number threshold.

In step S6b, when it is monitored that the cumulative number I of loop executions of step S5 reaches the predetermined number threshold I, the flow ends; when it is monitored that the cumulative number I of loop executions of step S5 does not reach the predetermined number threshold I, step S7b is executed.

And S7b, taking the sub-sample picture with the calibration class mark as a new mother sample picture, and constructing a new mother sample picture database.

After step S7b is finished, step S2 is executed again based on the new mother sample picture database to loop through steps S2 to S7b until step S6b in a certain loop process determines that the cumulative number of loop executions I of step S5 reaches the predetermined threshold value I. It should be noted that, for specific descriptions of the steps S1 through S5 in this embodiment, reference may be made to corresponding contents in the foregoing embodiments, and details are not described here again.

In the above-mentioned process of executing steps S2 to S7b in a loop, when step S5 is completed i times, the number of the obtained sub-sample pictures with completed criteria is (M × N × p)ⁱC, the side length of each sub-sample picture is qⁱH, i are positive integers.

It should be noted that the specific value of the threshold for the predetermined number of times is preset manually according to the size of the training sample picture required by the predetermined task. For example, when the predetermined task is a lesion detection task for a fundus picture, assuming that the side length of the mother sample picture in step S1 is 1600 pixels and the first predetermined coefficient q in step S3 is 0.6, the size of the sub sample picture obtained after the 9 th execution of step S5 described above is calculated in advance to be 16 × 16 (unit: pixel). At this time, the predetermined number threshold may be set to 9.

By the sample data acquisition method shown in fig. 4, small-sized sample pictures with side lengths less than or equal to the predetermined length threshold can be extracted from the large-sized sample pictures, and the small-sized sample pictures are automatically labeled. Meanwhile, the size of the finally obtained sub-sample picture can be controlled based on the 'threshold of predetermined times'.

Fig. 5 is a block diagram of a sample data acquiring system according to an embodiment of the present disclosure, and as shown in fig. 5, the sample data acquiring system may be used to implement the sample data acquiring method provided in the foregoing embodiments, and the sample data acquiring system includes: the system comprises a first building module 1, a sampling module 2, an extraction module 3, a training module 4 and a processing module 5.

Wherein, first construction module 1 is used for constructing the picture database of mother's sample, and the picture database of mother's sample includes: and multiple mother sample pictures with calibration type marks.

The sampling module 2 is used for sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, and each mother sample picture set comprises a plurality of mother sample pictures;

the extraction module 3 is configured to, for each mother sample picture set, extract a plurality of sub sample pictures from each mother sample picture in the mother sample picture set by using a culling frame with a predetermined size, and assign a preliminary category label to each sub sample picture to obtain a sub sample picture set corresponding to each mother sample picture set, where the preliminary category label of the sub sample picture is a calibration category label of the mother sample picture to which the sub sample picture belongs, and the size of the culling frame is smaller than that of the mother sample picture;

the training module 4 is configured to train a sample classification model corresponding to each sub-sample picture set by using, as training sample data, all sub-sample pictures included in each sub-sample picture set and a preliminary class label corresponding to each sub-sample picture;

the processing module 5 is configured to, for each sub-sample picture in each sub-sample picture set, input the sub-sample picture into each sample classification model respectively, so that each sample classification model outputs a corresponding classification result, and select a classification result with the largest frequency as a calibration label of the sub-sample picture.

Fig. 6 is a block diagram of a first building block in the present disclosure, and as shown in fig. 6, as an alternative, the first building block 1 includes: an acquisition unit 101, a size adjustment unit 102 and a construction unit 103.

The collecting unit 101 is configured to collect a plurality of original sample pictures with calibration type marks.

The resizing unit 102 is configured to perform resizing processing on the original sample picture to unify the size of the original sample picture.

The constructing unit 103 is configured to use the original sample picture subjected to the resizing processing as a parent sample picture to construct a parent sample picture database.

In some embodiments, the parent sample picture is square in shape; the shape of the selection frame is square; the ratio of the side length of the selection frame to the side length of the mother sample picture is equal to a first preset coefficient q, wherein q is more than 0 and less than 1; further preferably, the first predetermined coefficient q satisfies: q is more than or equal to 0.5 and less than or equal to 0.7.

In some embodiments, in the process that the sampling module 2 samples the database of the mother sample pictures for multiple times to obtain a plurality of corresponding mother sample picture sets, the number of mother sample pictures included in each mother sample picture set is equal; the ratio of the number of the mother sample pictures contained in one mother sample picture set to the number of the mother sample pictures contained in the mother sample picture database is equal to a second predetermined coefficient p, wherein 0 < p < 1. Further preferably, the second predetermined coefficient p satisfies: p is more than or equal to 0.4 and less than or equal to 0.6.

In some embodiments, in the process that the extracting module 3 extracts a plurality of sub-sample pictures from each mother sample picture in the mother sample picture set by using the culling box with a predetermined size, the number of the sub-sample pictures extracted from one mother sample picture is a predetermined number N; wherein the predetermined number N is a positive integer, and N is more than or equal to 3 and less than or equal to 10.

For the specific description of each module and unit in this embodiment, reference may be made to the corresponding contents in the foregoing method embodiments, and details are not repeated here.

Fig. 7 is a block diagram of another sample data acquiring system provided in the embodiment of the present disclosure, and as shown in fig. 7, the sample data acquiring system shown in fig. 7 may be used to implement the sample data acquiring method shown in fig. 3, where the sample data acquiring system shown in fig. 7 includes the first building module 1, the sampling module 2, the extracting module 3, the training module 4, and the processing module 5 shown in fig. 5, and further includes: a judging module 6a, a second constructing module 7a and a first control module 8 a.

The judging module 6a is configured to, after the processing module determines the calibration class label of each sub-sample picture in each sub-sample picture set, judge whether the side length of the sub-sample picture is less than or equal to a predetermined length threshold.

The second construction module 7a is configured to, when the judgment module 6a judges that the side length of the sub-sample picture is greater than the predetermined length threshold, construct a new mother sample picture database by using the sub-sample picture with the calibration class mark as a new mother sample picture, and control the sampling module 2 to continue to execute corresponding processing based on the new mother sample picture database;

the first control module 8a is used for controlling the sample data acquisition system to stop working when the judging module 6a judges that the side length of the sub-sample picture is less than or equal to the preset length threshold.

For the specific description of each module in this embodiment, reference may be made to the corresponding content in the foregoing method embodiment, which is not described herein again.

Fig. 8 is a block diagram of another sample data acquiring system provided in the embodiment of the present disclosure, and as shown in fig. 8, the sample data acquiring system shown in fig. 8 may be used to implement the sample data acquiring method shown in fig. 4, where the sample data acquiring system shown in fig. 8 includes the first building module 1, the sampling module 2, the extracting module 3, the training module 4, and the processing module 5 shown in fig. 5, and further includes: a monitoring module 6b, a third building module 7b and a second control module 8 b.

The monitoring module 6b is configured to monitor whether the accumulated number of times of loop execution of the processing module reaches a predetermined number threshold after the processing module determines the calibration standard of each sub-sample picture in each sub-sample picture set;

the third constructing module 7b is configured to construct a new parent sample picture database by using the child sample picture with the calibration class mark as a new parent sample picture when the monitoring module 6b monitors that the number of times of the loop execution accumulation does not reach the predetermined number threshold, and control the sampling module 2 to continue to execute corresponding processing based on the new parent sample picture database;

the second control module 8b is configured to control the sample data obtaining system to stop working when the monitoring module 6b monitors that the accumulated number of times of loop execution reaches the threshold of the predetermined number of times.

As a specific application scenario, a predetermined task is exemplified as a focus detection task for a fundus picture.

Firstly, the fundus pictures which are marked are used as original samples, and the sample data acquisition method or the sample data acquisition system provided by any one of the embodiments is adopted to process the fundus pictures so as to obtain a large number of small-size sub-sample pictures, and the small-size sub-samples are marked. Wherein, the size of the finally obtained small-size sub-sample picture is assumed to be w × d;

and then, generating a focus detection model aiming at a focus detection task on the basis of a deep learning technology by taking the obtained large number of small-size sub-sample pictures as training sample data. The trained focus detection model is assumed to be a binary classification model, and the binary classification model can be used for detecting whether a focus exists in an input picture.

Then, dividing a fundus picture to be processed (not marked) into a plurality of detection areas with the size of w multiplied by d, and inputting an image corresponding to each detection area into a previously trained focus detection model to detect whether a focus exists in each detection area as input data;

when detecting that a focus exists in at least one detection area, identifying that the focus exists in the fundus picture to be processed, and positioning a focus area according to the detection area with the focus; when the focus is detected to be present in the non-detection area, it is recognized that the focus is not present in the fundus picture to be processed.

The embodiment of the present disclosure further provides a server, where the server includes the sample data acquisition system provided in the foregoing embodiment.

An embodiment of the present disclosure further provides a server, where the server includes: one or more processors and storage; the storage device stores one or more programs thereon, and when the one or more programs are executed by the one or more processors, the one or more processors implement the sample data acquisition method provided in the foregoing embodiment.

The embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed, implements the sample data obtaining method provided in the foregoing embodiment.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods disclosed above, functional modules/units in the apparatus, may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A sample data acquisition method is characterized by comprising the following steps:

for each mother sample picture set, extracting a plurality of sub sample pictures from each mother sample picture in the mother sample picture set by using a selecting frame with a preset size, and endowing each sub sample picture with a preliminary class mark to obtain a sub sample picture set corresponding to each mother sample picture set, wherein the preliminary class mark of the sub sample picture is a calibration class mark of the mother sample picture to which the sub sample picture belongs, the size of the selecting frame is smaller than that of the mother sample picture, and the size of each sub sample picture in the sub sample picture set is smaller than that of the corresponding mother sample picture;

2. The method of claim 1, wherein the mother sample picture is square in shape;

the shape of the marquee is square;

3. The method according to claim 2, wherein after the step of inputting the sub-sample picture into each sample classification model for each sub-sample picture in each sub-sample picture set, so that each sample classification model outputs a corresponding classification result, and selecting the classification result with the largest frequency as the calibration class mark of the sub-sample picture, the method further comprises:

4. The method according to claim 2, wherein after the step of inputting the sub-sample picture into each sample classification model for each sub-sample picture in each sub-sample picture set, so that each sample classification model outputs a corresponding classification result, and selecting the classification result with the largest frequency as the calibration class mark of the sub-sample picture, the method further comprises:

5. The method according to claim 2, wherein the first predetermined coefficient q satisfies: q is more than or equal to 0.5 and less than or equal to 0.7.

6. The method according to claim 1, wherein in the step of sampling the mother sample picture database for multiple times to obtain a plurality of corresponding mother sample picture sets, each mother sample picture set contains an equal number of mother sample pictures;

7. The method according to claim 6, characterized in that said second predetermined coefficient p satisfies: p is more than or equal to 0.4 and less than or equal to 0.6.

8. The method according to claim 1, wherein in the step of extracting a plurality of sub-sample pictures from each of the mother sample pictures in the mother sample picture set by using a cull box with a predetermined size, the number of sub-sample pictures extracted from one mother sample picture is a predetermined number N;

9. The method according to any one of claims 1 to 8, wherein the step of constructing the database of mother sample pictures comprises:

10. A sample data acquisition system, comprising:

the extraction module is used for extracting a plurality of sub-sample pictures from each mother sample picture in the mother sample picture set by adopting a selection frame with a preset size aiming at each mother sample picture set, and endowing each sub-sample picture with a preliminary class mark to obtain the sub-sample picture set corresponding to each mother sample picture set, wherein the preliminary class mark of the sub-sample picture is a calibration class mark of the mother sample picture to which the sub-sample picture belongs, the size of the selection frame is smaller than that of the mother sample picture, and the size of each sub-sample picture in the sub-sample picture set is smaller than that of the corresponding mother sample picture;

11. The system of claim 10, wherein the parent sample picture is square in shape;

the shape of the marquee is square;

12. The system of claim 11, further comprising:

13. The system of claim 11, further comprising:

14. The system according to claim 11, wherein the first predetermined coefficient q satisfies: q is more than or equal to 0.5 and less than or equal to 0.7.

15. The system according to claim 10, wherein in the process of the sampling module performing multiple sampling on the mother sample picture database to obtain a plurality of corresponding mother sample picture sets, each mother sample picture set contains an equal number of mother sample pictures;

16. The system according to claim 15, wherein the second predetermined coefficient p satisfies: p is more than or equal to 0.4 and less than or equal to 0.6.

17. The system according to claim 10, wherein in the process of extracting a plurality of sub-sample pictures from each of the mother sample pictures in the mother sample picture set by the extracting module using a cull box with a predetermined size, the number of sub-sample pictures extracted from one mother sample picture is a predetermined number N;

18. The system according to any one of claims 10-17, wherein the first building block comprises:

19. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-9.

20. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.