CN112632179A

CN112632179A - Model construction method and device, storage medium and equipment

Info

Publication number: CN112632179A
Application number: CN201910904555.1A
Authority: CN
Inventors: 韩旭红
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2021-04-09
Anticipated expiration: 2039-09-24
Also published as: CN112632179B

Abstract

The model construction method, the model construction device, the storage medium and the processor provided by the invention are used for acquiring online data; carrying out crowdsourcing labeling on the online data to obtain a labeling result; the labeling result comprises: the marked training data and the marked condition information; acquiring the labeling quality of the labeling result; selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling quality of the labeling result; and training the corresponding machine learning model by using the marked training data according to the corresponding model training parameters to obtain a trained prediction model. According to the method, online data and a modeling configuration process are combined, automatic configuration of the model and the training parameters of the model is realized, the modeling threshold of non-professionals is fully reduced, the model can be obtained when the whole modeling process is changed into data, convenience of model construction is improved, and time cost of modeling personnel is reduced.

Description

Model construction method and device, storage medium and equipment

Technical Field

The present invention relates to the field of data processing, and more particularly, to a model construction method, apparatus, storage medium, and device.

Background

Visual modeling (visuall modeling) is a method of organizing a problem using a model of an organization around real ideas that provides a mechanism to view the developed system from different perspectives.

At present, a lot of visual modeling tools are hatched in the market, but the existing modeling tools are relatively in a demo level, main modeling work is generally required to be delivered to modeling personnel, modeling is carried out in a manual operation mode, and the modeling process is very complicated and inconvenient; after the model is used online, if the online data changes, modeling personnel are required to retrain the model or optimize the model so as to adapt to the changed online data; when modeling personnel are not particularly professional and model related items such as feature selection and parameter setting of the model are probably unknown, a lot of obstacles and inconvenience are brought to modeling work.

Therefore, a practical and effective model construction scheme is urgently needed at present to improve the convenience of model construction.

Disclosure of Invention

In view of the above, the present invention has been made to provide a model construction method, apparatus, storage medium, and processor that overcome or at least partially solve the above-mentioned problems.

In order to achieve the purpose, the invention provides the following technical scheme:

a model building method, comprising:

acquiring online data;

carrying out crowdsourcing labeling on the online data to obtain a labeling result; the labeling result comprises: the marked training data and the marked condition information;

acquiring the labeling quality of the labeling result;

selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling quality of the labeling result;

and training the corresponding machine learning model by using the marked training data according to the corresponding model training parameters to obtain a trained prediction model.

Preferably, the obtaining of the labeling quality of the labeling result includes:

according to the labeling result, acquiring the individual labeling accuracy of each labeling person and the overall labeling accuracy of all labeling persons;

when the individual marking accuracy of the marking personnel is lower than the overall marking accuracy and the deviation of the individual marking accuracy from the overall marking accuracy is larger than a preset deviation threshold, acquiring the number of the partial marking personnel of which the individual marking accuracy is lower than a first accuracy threshold;

and determining the labeling quality of the labeling result according to the number of the part of the labeling personnel.

Preferably, after the obtaining of the labeling quality of the labeling result, the method further includes:

and adjusting the proportion of the number of the selected marked people to the number of the consistent recovered people according to the marking quality of the marking result.

Preferably, the selecting the corresponding machine learning model and the corresponding model training parameter according to the current task type and the labeling quality of the labeling result includes:

determining the labeling difficulty of the on-line data according to the labeling quality of the labeling result;

and selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling difficulty of the online data.

Preferably, after the training of the corresponding machine learning model is performed according to the corresponding model training parameters by using the labeled training data to obtain a trained prediction model, the method further includes:

acquiring new online data;

carrying out crowdsourcing labeling on the data on the new line to obtain a new labeling result;

predicting the new online data by using the trained prediction model to obtain a prediction result;

and determining whether the trained prediction model meets the expected effect or not according to the new labeling result and the prediction result.

Preferably, after determining that the trained predictive model is in accordance with the expected effect, the method further comprises:

acquiring entropy distribution information of the new online data according to the new labeling result and the prediction result;

setting a prediction result alarm condition according to the entropy value distribution information of the new online data;

monitoring subsequent prediction results of the prediction model;

and when the subsequent prediction result of the prediction model meets the prediction result alarm condition, selecting active learning data to carry out incremental training on the prediction model to obtain the optimized prediction model.

Preferably, the monitoring the later prediction result of the prediction model includes:

acquiring a subsequent prediction result of the prediction model according to a preset time interval;

acquiring entropy information of the subsequent prediction result according to the subsequent prediction result;

and monitoring entropy information of the subsequent prediction result to determine whether a preset entropy alarm condition is reached.

A model building apparatus comprising:

the crowdsourcing processing unit is used for acquiring online data; carrying out crowdsourcing labeling on the online data to obtain a labeling result; the labeling result comprises: the marked training data and the marked condition information; acquiring the labeling quality of the labeling result;

the algorithm processing unit is used for selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling quality of the labeling result; and training the corresponding machine learning model by using the marked training data according to the corresponding model training parameters to obtain a trained prediction model.

A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the model construction method as described above when the program is run.

A model building device comprises at least one processor, at least one memory connected with the processor, and a bus; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the model building method.

By means of the technical scheme, the model construction method, the device, the storage medium and the processor provided by the invention have the advantages that after crowdsourcing labeling is carried out on the online data, the labeling quality of the labeling result is obtained, and the corresponding machine learning model and the corresponding model training parameters are selected according to the current task type and the labeling quality of the labeling result, so that the online data and the modeling configuration process are combined, the automatic configuration of the model and the model training parameters is realized, the modeling threshold of non-professionals is fully reduced, the model can be obtained when the whole modeling process has data, the convenience of model construction is improved, and the time cost of modeling personnel is also reduced.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a model construction method provided in an embodiment of the present application;

fig. 2 is a flowchart of an annotation quality obtaining scheme provided in an embodiment of the present application;

FIG. 3 is a flowchart of a model effect testing scheme provided by an embodiment of the present application;

FIG. 4 is a flow chart of a model monitoring optimization scheme provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a model building apparatus provided in an embodiment of the present application;

fig. 6 is another schematic structural diagram of a model building apparatus according to an embodiment of the present application;

FIG. 7 is a business process flow diagram of a knowledge platform provided by an embodiment of the present application;

fig. 8 is a flow chart of a service process of a crowdsourcing processing unit according to an embodiment of the present application;

fig. 9 is a flowchart of a service processing of an algorithm processing unit according to an embodiment of the present application;

fig. 10 is a flowchart of a service process of a monitoring optimization unit according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, fig. 1 is a flowchart of a model building method according to an embodiment of the present disclosure.

As shown in fig. 1, the model building method of the present embodiment may include:

s101: online data is acquired.

S102: and carrying out crowdsourcing marking on the online data to obtain a marking result.

Wherein, the labeling result at least comprises: and marking the marked training data and marking condition information.

The marking condition information mainly refers to information such as the number of people marking the data to be marked, and label information marked by each marking person aiming at the data to be marked. Marking condition information a data has a list of people and a person marks which data

S103: and acquiring the labeling quality of the labeling result.

And analyzing the labeling quality of the labeling result according to the labeled training data and the labeled condition information. For example, the quality inspection can be performed by an auditor through sampling the labeling results, and the labeling quality of the labeling results is fed back.

S104: and selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling quality of the labeling result.

The current task type may be set by a user in advance, or may be determined according to the type of online data.

In an example, step S104 may include:

a1, determining the labeling difficulty of the online data according to the labeling quality of the labeling result;

the marking quality of the marking result reflects the marking difficulty of the online data to a certain extent, so that the marking difficulty of the online data can be determined according to the marking quality of the marking result. Furthermore, the invention can also preset the corresponding relation between the labeling quality and the labeling difficulty.

A2, selecting a corresponding machine learning model and corresponding model training parameters according to the current task type and the labeling difficulty of the online data.

In this example, the present invention may also be preconfigured in a plurality of machine learning models available for the current task type, and each machine learning model may have a corresponding labeling difficulty, so that after the plurality of machine learning models available for the current task type are determined, the corresponding machine learning model may be selected according to the labeling difficulty of the online data.

Specifically, different task types may correspond to different types of machine learning models, and after the type of the machine learning model suitable for the current task type is determined, the corresponding machine learning model and the corresponding model training parameters may be selected according to the labeling difficulty of the online data. There may be multiple machine learning modules of the same type.

And the labeling difficulty and the model training parameters also have a preset corresponding relation.

For example, in the case of deep learning, it is more difficult for a task to set up a more complex network structure.

S105: and training the corresponding machine learning model by using the marked training data according to the corresponding model training parameters to obtain a trained prediction model.

According to the model construction method provided by the embodiment, after crowdsourcing labeling is carried out on online data, the labeling quality of a labeling result is obtained, and according to the current task type and the labeling quality of the labeling result, a corresponding machine learning model and corresponding model training parameters are selected, so that the online data and a modeling configuration process are combined, automatic configuration of the model and the model training parameters is realized, the modeling threshold of non-professionals is fully reduced, the model can be obtained when the whole modeling process has data, convenience in model construction is improved, and time cost of modeling personnel is also reduced.

In specific application, the labeling quality of the labeling result can be obtained by analyzing the labeling accuracy of the labeling personnel. Correspondingly, the invention also provides a labeling quality obtaining scheme, so as to obtain the labeling quality of the labeling result according to the labeling result.

Referring to fig. 2, fig. 2 is a flowchart of a scheme for obtaining annotation quality according to an embodiment of the present application.

As shown in fig. 2, the annotation quality obtaining scheme of this embodiment may include:

s201: and according to the labeling result, acquiring the individual labeling accuracy of each labeling person and the overall labeling accuracy of all labeling persons.

And according to the labeling result, counting the individual labeling accuracy of each labeling person, and obtaining the overall labeling accuracy of all labeling persons. The individual marking accuracy of the marking personnel can be obtained by the marking auditor by auditing and evaluating the marking result obtained this time, or can be obtained by the data analysis software by comparing and analyzing the marking result with the preset standard marking result.

S202: and when the individual marking accuracy of the marked personnel is lower than the overall marking accuracy and the deviation of the overall marking accuracy is larger than a preset deviation threshold, acquiring the number of the partial marked personnel of which the individual marking accuracy is lower than a first accuracy threshold.

And if the overall marking accuracy is higher and the individual marking accuracy of the individual person is lower, acquiring the number t of the part of marking persons with the individual marking accuracy lower than the first accuracy threshold.

S203: and determining the labeling quality of the labeling result according to the number of the part of the labeling personnel.

Wherein, step S203 may include:

b1, acquiring the total number of marked people N of the crowdsourcing labels, the number of label selecting people N and the number of consistent recycling people m.

The number of the label selecting persons refers to the number of the selected label persons for labeling the on-line data; the number of the consistent recovery persons refers to the number of the marking persons required to mark the result to be consistent when the marking result is recovered. The specific meaning of "recovery" refers to the adoption of the labeled result.

B2, calculating the marking quality estimated value p of the marking result by adopting the following formula^o：

Wherein N is the total number of labels, N is the number of label selections, m is the number of consistent recycling persons, t is the number of partially labeled persons, and threshold1 is a first accuracy threshold.

Regarding the values of m and n, the values can be based on task difficulty, labeling speed and inconsistency, for example, if the task is simple and the labeling quality is high, the recovery ratio m to n can be set to be lower, for example, three persons are selected for labeling, and two persons are consistent for recovery, so that the better-quality labeling data can be ensured to be recovered; if the task is difficult and the labeling quality is low, a recovery ratio m: n which is higher than the target value needs to be set, for example, seven people are selected for labeling, and five people are consistent for recovery, and if three people are still selected and two people are still selected for consistent recovery, the recovered labeling data quality is possibly low.

The term "seven-person labeling and five-person consistency recovery" specifically means that seven persons label the same data, and when the labeled contents of the five persons of the data are consistent, the labeling results of the five persons who reach the consistency are adopted.

The first threshold of accuracy 1, which is actually a threshold of personal annotation accuracy of the annotating person, represents the lowest acceptable threshold of personal annotation accuracy for the annotating person. For example, the individual labeling accuracy of some labeling personnel is 50%, and if the two categories are about pure Mongolia, the person cannot be accepted.

In addition, in order to avoid the occurrence of cheating of a large number of marking personnel, quality inspection can be performed by combining a manual checking mode while the marking quality is determined, the marking quality of a marking result is fed back, and the marking quality is obtained by combining two modes, so that the reliability of the marking quality is improved.

Whether the annotating personnel cheat is judged by judging whether the accuracy rate of the annotating personnel in the category with larger distribution is higher than the accuracy rate of the category with smaller distribution through judging the texts with more categories in the data distribution. The more the height is, the larger the cheating suspicion is, and certainly, the difference of understanding of the annotation specification by the annotation personnel is large, so that the annotation personnel needs to communicate with the annotation personnel to improve the annotation quality.

The annotation accuracy of a certain annotating person in a certain category is equal to the number of annotating persons in the category/the total number of annotations in the category, which is somewhat similar to the concept of recall rate.

In an example, after the labeling quality of the labeling result is obtained, the invention can also adjust the proportion of the number n of the label selecting people to the number m of the consistent recycling people according to the labeling quality of the labeling result.

For example, when p^oIf the number is more than threshold2, the number of the recovery ratio people is increased, and the influence of the labeling personnel with poor labeling quality on the labeling quality is avoided. Wherein, threshold2 is the second accuracy threshold.

The second accuracy threshold2, which is actually the accuracy threshold of the labeled quality, is used to limit the labeled quality and needs to be set according to the requirement of the user for the labeled quality. For example, for more complex tasks, threshold2 may be set lower; for simpler tasks, threshold2 may be set higher.

In general, threshold1 < threshold2, label mass estimate p^oIn fact, it is approximately equal to the average of the labeled masses of all labeled persons, and the threshold1 is equivalent to the lower limit of the labeled mass.

According to the annotation quality acquisition scheme provided by the embodiment, the annotation quality of the annotation result is determined through statistics of the annotation accuracy of crowdsourcing annotators, so that the time cost of manual examination can be fully reduced; certainly, the method can also be combined with manual review, the labeling quality of crowdsourcing labeling personnel is autonomously improved, and meanwhile, the data labeling quality is improved while part of manual review work is saved.

In specific application, after the prediction model is preliminarily trained, the prediction effect of the model needs to be tested to verify whether the prediction effect of the trained prediction model can reach an expected level. Correspondingly, the invention also provides a model effect test scheme to test whether the prediction effect of the trained prediction model is in accordance with the expectation.

Referring to fig. 3, fig. 3 is a flowchart of a model effect testing scheme according to an embodiment of the present application.

As shown in fig. 3, the model effect testing scheme of the present embodiment may include:

s301: new online data is acquired.

The new online data, which is also online data, may be a batch of data randomly drawn from the online data and used as test data, unlike the online data used in model training.

S302: and carrying out crowdsourcing labeling on the new online data to obtain a new labeling result.

S303: and predicting the new online data by using the trained prediction model to obtain a prediction result.

S304: and determining whether the trained prediction model meets the expected effect or not according to the new labeling result and the prediction result.

Under normal conditions, if the consistency of the new labeling result and the prediction result is higher, the trained prediction model is determined to be in accordance with the expected effect, otherwise, the trained prediction model is determined to be not in accordance with the expected effect.

If the trained prediction model accords with the expected effect, performing an online link; and if the data do not accord with the expected effect, judging whether the model is over-fitted or under-fitted, supplementing training data if the model is over-fitted, adjusting the complexity of the model or replacing the model if the model is under-fitted, and simultaneously checking the influence of the data annotation quality problem to optimize the model so as to finally accord with the expected effect.

The model effect testing scheme provided by this embodiment actually uses new online data as test data, predicts the test by using the prediction model, performs crowdsourcing labeling on the test data, compares the prediction result with the labeling result, and determines whether the trained prediction model meets the expected effect according to the consistency between the new labeling result and the prediction result reflected by the comparison result, thereby ensuring the accuracy of the prediction model.

In specific application, after the trained prediction model is determined to meet the expected effect, the prediction model is used online, online data can change along with time or data source changes in the using process, when the online data changes, the prediction effect of the prediction model can possibly have problems, so that the prediction effect of the prediction model needs to be monitored in real time, and once the problems are found, the prediction model needs to be optimized in time, and the accuracy of the prediction model is ensured. Correspondingly, the invention also provides a model monitoring optimization scheme to realize monitoring and optimization of the prediction model.

Referring to fig. 4, fig. 4 is a flowchart of a model monitoring optimization scheme according to an embodiment of the present disclosure.

As shown in fig. 4, the model monitoring optimization scheme of the present embodiment may include:

s401: and acquiring entropy distribution information of the new online data according to the new labeling result and the prediction result.

Entropy, also known as entropy, is the larger the entropy, the more uncertain the representation. Model monitoring and self-optimization are performed according to the index of entropy.

The formula of entropy value is:

where p represents probability and entrypy represents entropy.

After the prediction model passes through the model test and achieves the expected effect, the entropy distribution of the test data can be counted to obtain the entropy distribution information of the test data.

S402: and setting a prediction result alarm condition according to the entropy value distribution information of the new online data.

According to the entropy distribution information of the new online data, the prediction accuracy of the prediction model in different entropy intervals can be calculated, an entropy threshold e is determined according to a preset standard reaching condition, and the data occupation ratio c of the entropy larger than the entropy threshold e under the relevant category is determined.

For example, different entropy intervals are set, such as 0 to 0.1, 0.1 to 0.2, …, and 0.9 to 1, and thus the division is performed for 10 intervals, because each data prediction result corresponds to an entropy value, which falls in one of the ten intervals. Therefore, each entropy interval corresponds to a part of data, each data has a predicted value (model prediction) and a marked value (artificial marking), the consistency is correct, and the inconsistency is wrong, so that the accuracy of the data of each threshold interval can be obtained, and the entropy threshold e is set according to the accuracy. The data ratio c is the total amount of data in the entropy interval calculated according to the set entropy threshold e.

In addition, a floating threshold value threshold3 is also needed to be set, and if the data proportion of entropy value larger than the entropy value threshold value e under the prediction category is larger than c × threshold3, it is determined that the prediction result alarm condition is met, and the monitoring alarm is triggered.

The floating threshold value threshold3 is also flexibly set according to the requirement of the labeling quality.

S403: and monitoring subsequent prediction results of the prediction model.

In one example, step S403 may include:

c1, obtaining a subsequent prediction result of the prediction model according to a preset time interval;

c2, acquiring entropy information of the subsequent prediction result according to the subsequent prediction result;

and C3, monitoring entropy information of the subsequent prediction result, and judging whether a preset entropy alarm condition is reached.

S404: and when the subsequent prediction result of the prediction model meets the prediction result alarm condition, selecting active learning data to carry out incremental training on the prediction model to obtain the optimized prediction model.

After the preset entropy alarm condition is reached, online data under each category predicted by the model can be selected according to the entropy threshold value e, data with preset proportion (needing to be set according to the model) is randomly sampled to serve as active learning data, and incremental training is carried out on the model to optimize the prediction model.

And after the model optimization is completed, selecting a new batch of test data, and evaluating whether the optimized prediction model meets the expectation. If the prediction effect of the optimized prediction model is not improved, judging whether the data distribution has larger variation or not by extracting the model characteristics and comparing and analyzing the data difference, and if so, selecting random data to supplement training data; if not, indicating that the current prediction model is insufficient for the current task, and recommending model replacement, model complexity adjustment and model parameter modification.

The model monitoring optimization scheme provided by the embodiment monitors the prediction model in real time after the prediction model is used online, once the prediction effect of the model is found to be poor, the self-optimization process of the prediction model is started, when the data distribution changes along with time or the user data type is changed, the problem can be found at the first time, the model is automatically optimized in time, the online effect is guaranteed, and the problem that the effect of the model is poor on some specific data is avoided.

The embodiment of the present invention further provides a model building apparatus, where the model building apparatus is configured to implement the model building method provided in the embodiment of the present invention, and the technical content of the model building apparatus described below may be referred to in correspondence with the technical content of the model building method described above, and the same or similar parts are not described again.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a model building apparatus according to an embodiment of the present disclosure.

The model building apparatus of this embodiment is used for implementing the model building method of the foregoing embodiment, and as shown in fig. 5, the apparatus at least includes: a crowdsourcing processing unit 100 and an algorithm processing unit 200.

A crowdsourcing processing unit 100 for obtaining online data; carrying out crowdsourcing labeling on the online data to obtain a labeling result; the labeling result comprises: the marked training data and the marked condition information; and acquiring the labeling quality of the labeling result.

The crowdsourcing processing unit 100 may include a crowdsourcing label subunit and a quality control subunit.

And the crowdsourcing labeling subunit is used for carrying out crowdsourcing labeling on the online data to obtain a labeling result.

And the quality control subunit is used for acquiring the labeling quality of the labeling result.

In one example, the quality control subunit is specifically configured to: according to the labeling result, acquiring the individual labeling accuracy of each labeling person and the overall labeling accuracy of all labeling persons; when the individual marking accuracy of the marking personnel is lower than the overall marking accuracy and the deviation of the individual marking accuracy from the overall marking accuracy is larger than a preset deviation threshold, acquiring the number of the partial marking personnel of which the individual marking accuracy is lower than a first accuracy threshold; and determining the labeling quality of the labeling result according to the number of the part of the labeling personnel.

In an example, the crowd-sourced tagging subunit is further configured to: and adjusting the proportion of the number of the selected marked people to the number of the consistent recovered people according to the marking quality of the marking result.

The algorithm processing unit 200 is configured to select a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling quality of the labeling result; and training the corresponding machine learning model by using the marked training data according to the corresponding model training parameters to obtain a trained prediction model.

In one example, the selecting, according to the current task type and the labeling quality of the labeling result, a corresponding machine learning model and a corresponding model training parameter includes: determining the labeling difficulty of the on-line data according to the labeling quality of the labeling result; and selecting a corresponding machine learning model and a corresponding model training parameter according to the current task type and the labeling difficulty of the online data.

In an example, the crowd-sourced labeling subunit is further to: acquiring new online data; and carrying out crowdsourcing labeling on the new online data to obtain a new labeling result.

Correspondingly, the algorithm processing unit 200 is further configured to: predicting the new online data by using the trained prediction model to obtain a prediction result; and determining whether the trained prediction model meets the expected effect or not according to the new labeling result and the prediction result.

The model building device provided by the embodiment acquires the labeling quality of the labeling result after crowdsourcing and labeling the online data, and selects the corresponding machine learning model and the corresponding model training parameters according to the current task type and the labeling quality of the labeling result, so that the online data and the modeling configuration process are combined, the model and the model training parameters are automatically configured, the modeling threshold of non-professionals is fully reduced, the whole modeling process is enabled to be provided with data to obtain the model, the convenience of model building is improved, and the time cost of the modeling staff is reduced.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a model building apparatus according to an embodiment of the present application.

As shown in fig. 6, the apparatus includes a monitor optimization unit 300 in addition to the crowdsourcing processing unit 100 and the algorithm processing unit 200 in the foregoing embodiment.

The monitoring optimization unit 300 is configured to: after the trained prediction model is determined to accord with the expected effect, acquiring entropy value distribution information of the new online data according to the new labeling result and the prediction result; setting a prediction result alarm condition according to the entropy value distribution information of the new online data; monitoring subsequent prediction results of the prediction model; and when the subsequent prediction result of the prediction model meets the prediction result alarm condition, selecting active learning data to carry out incremental training on the prediction model to obtain the optimized prediction model.

Wherein the monitoring of the later prediction result of the prediction model may include: acquiring a subsequent prediction result of the prediction model according to a preset time interval; acquiring entropy information of the subsequent prediction result according to the subsequent prediction result; and monitoring entropy information of the subsequent prediction result to determine whether a preset entropy alarm condition is reached.

The model building device provided by the embodiment monitors the prediction model in real time after the prediction model is used online, once the prediction effect of the model is found to be poor, the self-optimization process of the prediction model is started, when the data distribution changes along with time or the user data type is changed, the problem can be found at the first time, the model is automatically optimized in time, the online effect is ensured, and the problem that the effect of the model is poor on some specific data is avoided.

In practical application, the model construction device can be used as an automatic knowledge platform to realize the functions of automatic modeling, on-line data monitoring and model self-optimization. The business processing flow of the automated knowledge platform can be as shown in fig. 7.

Firstly, sending acquired online data into a crowdsourcing processing unit, and marking a first batch of training data through the crowdsourcing processing unit to train a model;

and then, carrying out model training through the algorithm processing unit to obtain a trained prediction model, and when the prediction model is in accordance with expectation, sending a prediction result of the prediction model after being on-line into the monitoring optimization unit.

And then, when the monitoring optimization unit gives an alarm, active learning data is selected to optimize the prediction model.

And if the monitoring optimization unit does not give an alarm, outputting the prediction result.

The crowdsourcing processing unit mainly comprises a crowdsourcing marking subunit and a quality control subunit. The crowdsourcing labeling subunit can be a data labeling platform, and label labeling is mainly performed on input data by crowdsourcing labeling personnel to obtain label data which is used as model training data; the quality control subunit is mainly used for auditing the crowdsourcing marking quality, adjusting marking specifications, crowdsourcing personnel and the like according to an auditing result, controlling the marking quality and ensuring the accuracy of training data and test data.

In the crowd-sourced marking subunit, the method may further include: a classification labeling module, a sequence labeling module and the like, and various labeling modules can be added in the follow-up process according to the requirements. The classification labeling module is taken as an example, label data (category number) can be automatically defined, several people are required for labeling, and several people are consistently recycled, so that the labeling quality is ensured. And a trap question can be added to evaluate the labeling quality of labeling personnel and prevent cheating.

The business processing flow of the crowdsourcing processing unit can be as shown in fig. 8. When data to be labeled come in, managers formulate corresponding labeling specifications and labeling methods according to requirements, train labeling personnel and finish labeling work by the labeling personnel.

The quality control of the quality control subunit mainly has two aspects, namely, sampling by marking results, and performing quality inspection by an auditor to obtain data marking quality; in the second aspect, the marking accuracy of the marking personnel is obtained according to the marking result statistics, and the data marking quality is analyzed by comparing the marking accuracy of each person with the overall accuracy of all the marking personnel crowdsourced, which can be seen in the marking quality obtaining scheme of the previous embodiment.

If more people have lower accuracy, the data is considered to have problems or the marking specification needs to be optimized and revised. And training the marking personnel.

The algorithm processing unit can select various machine learning models and deep learning models, selects different models for training according to different task types, and selects a better model in an initial stage by comparing the effect of each model under the same type of tasks. For example, under the classification task, there may be algorithms such as traditional machine learning model SVM, logistic regression, random forest, etc., or there may be methods such as deep learning RNN, LSTM, etc.

And for the relevant parameters of the model and the network layer number, the adjustment can be carried out according to the labeling quality and the labeling difficulty given by the crowdsourcing processing unit. For example, when the labeling difficulty reaches 80%, which belongs to a difficult category, the number of network layers of the deep learning model may be set to be 5, the number of nodes of the hidden layer may be set to be 800, and the like.

The business processing flow of the algorithm processing unit can be as shown in fig. 9.

Firstly, preprocessing such as data cleaning is carried out on acquired online data, for example, data de-duplication, irrelevant data removal, data format processing, special character expression symbol processing such as messy codes and the like, then training data are selected and sent to a crowdsourcing processing unit for data tagging, the obtained result is split into training set data train data and verification set data valid data, model training is carried out, and an initial model, namely a preliminarily trained prediction model, is obtained.

The difficulty degree of data labeling (problems such as cheating and the like) can be obtained according to the accuracy rate of the crowdsourcing labeling result, and the corresponding model or model parameters can be set for modeling according to the difficulty degree of labeling. For example, in the case of deep learning, if the task is difficult, a more complicated network structure may be set.

And then, predicting new on-line data by using an initial model, randomly sampling a batch of data as test set data, sending the data to a crowdsourcing processing unit for data annotation, evaluating the model prediction effect, entering an on-line link if the model prediction effect reaches the expectation, judging whether the model is over-fit or under-fit if the model prediction effect does not reach the expectation, supplementing training data if the model is over-fit, adjusting the model complexity or replacing the model if the model is under-fit, simultaneously checking the influence of a data annotation quality problem, and optimizing the model to enable the model to meet the requirements.

After the model prediction effect is expected, certain changes (caused by time, source and the like) of online data need to be considered, the model needs to be monitored in real time, the monitoring optimization unit monitors the model prediction effect in real time, if a problem is found, an alarm is triggered, active learning data are selected based on the model prediction result and an alarm set threshold value, and the active learning data are sent to the crowdsourcing processing unit for data annotation, so that the model is further iteratively optimized.

A flow diagram of the business process of the supervisory optimization unit can be seen at 10.

Firstly, after the initial model reaches the standard, carrying out statistics on test set entropy distribution to obtain test set entropy distribution, calculating the prediction accuracy of models in different entropy intervals, and obtaining an entropy threshold e according to the standard reaching condition and a data occupation ratio c of which the entropy is greater than the entropy threshold under related categories.

Secondly, monitoring the model prediction result at intervals of a fixed time period delta t, setting a floating threshold value threshold3, and triggering an alarm when the proportion of data with entropy values larger than an entropy value threshold value e under the prediction category is larger than c multiplied by threshold value 3.

The floating threshold is determined according to the on-line data quality requirement, and specific tasks need to be looked at, for example, the total error rate of the data entropy range can be obtained according to the data proportion of the entropy interval and the data accuracy statistics. The overall data accuracy estimation can be obtained by multiplying the overall error rate of the entropy interval by the data proportion of which the entropy is larger than the entropy threshold e, and then the requirement P for the overall data accuracy is added₀The corresponding data occupancy, namely the floating threshold3, can be inferred.

The application provides an automatic knowledge platform can liberate the developer time, directly combines online data and development environment to crowd's processing unit connects in series. The integration of data labeling, model training, result monitoring and model optimization is effectively promoted, and the cost of data labeling, model development and model optimization is saved; in addition, the modeling threshold of non-professionals is fully reduced, the modeling effect of the non-professionals is improved, all the models can be obtained by data, and the models can be monitored in real time by using the models; moreover, the monitoring optimization unit ensures the online effect and avoids the problem that the effect of the model is poor on some specific data. When data distribution changes along with time or user data types are changed, problems can be found at the first time, and the model can be automatically optimized in time.

The model building device provided by the embodiment of the invention comprises a processor and a memory, wherein the crowdsourcing processing unit 100, the algorithm processing unit 200, the monitoring optimization unit 300, the crowdsourcing marking subunit, the quality control subunit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and convenience of model construction is improved by adjusting kernel parameters.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the model construction method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the model construction method is executed when the program runs.

As shown in fig. 11, an embodiment of the present invention further provides an electronic device 10, which includes at least one processor 11, at least one memory 12 connected to the processor 11, and a bus 13; wherein, the processor 11 and the memory 12 complete mutual communication through a bus 13; the processor 11 is arranged to call program instructions in the memory 12 to perform the model building method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

acquiring online data;

acquiring the labeling quality of the labeling result;

acquiring new online data;

monitoring subsequent prediction results of the prediction model;

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of model construction, comprising:

acquiring online data;

acquiring the labeling quality of the labeling result;

2. The method of claim 1, wherein the obtaining the labeling quality of the labeling result comprises:

3. The method of claim 1 or 2, wherein after said obtaining the labeling quality of the labeling result, the method further comprises:

4. The method of claim 1 or 2, wherein selecting the corresponding machine learning model and the corresponding model training parameters according to the current task type and the labeling quality of the labeling result comprises:

5. The method of claim 1, wherein after said training the corresponding machine learning model with the labeled training data according to the corresponding model training parameters to obtain a trained predictive model, the method further comprises:

acquiring new online data;

6. The method of claim 5, wherein after determining that the trained predictive model is satisfactory for expected effectiveness, the method further comprises:

monitoring subsequent prediction results of the prediction model;

7. The method of claim 6, wherein said monitoring the results of the later prediction of the predictive model comprises:

8. A model building apparatus, comprising:

9. A storage medium, characterized in that the storage medium comprises a stored program, wherein a device on which the storage medium is located is controlled to perform the model building method according to any one of claims 1-7 when the program is run.

10. A model building device is characterized by comprising at least one processor, at least one memory connected with the processor, and a bus; the processor and the memory complete mutual communication through a bus; a processor is used to call program instructions in the memory to perform the model building method of any one of claims 1-7.