CN115830419A

CN115830419A - Data-driven artificial intelligence technology evaluation system and method

Info

Publication number: CN115830419A
Application number: CN202310087039.0A
Authority: CN
Inventors: 丰强泽; 齐红威; 何鸿凌; 肖永红; 王大亮
Original assignee: Hebei Shuyuntang Intelligent Technology Co ltd; Datatang Beijing Technology Co ltd
Current assignee: Hebei Shuyuntang Intelligent Technology Co ltd; Datatang Beijing Technology Co ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-03-21

Abstract

The invention discloses a data-driven artificial intelligence technology evaluation system and a method, which relate to the technical field of artificial intelligence evaluation, wherein the system comprises an evaluation object unit, an evaluation process unit, an evaluation result unit and an algorithm optimization unit, wherein data information of the evaluation object unit is transmitted to the evaluation process unit for evaluation, and the evaluation result of the evaluation process unit is transmitted to the evaluation result unit; the algorithm optimization unit receives the evaluation result data and transmits the received evaluation result data to be optimized to the evaluation object unit for optimization; the method comprises the steps of generating an evaluation task; executing an evaluation task; and generating an evaluation result. The invention can quickly select a proper evaluation tool, evaluation data, evaluation standard and reference model to execute evaluation aiming at evaluation objects in different technical fields and different application scenes, outputs an evaluation result and promotes the algorithm optimization of the evaluation object, thereby greatly reducing the technical threshold of artificial intelligence evaluation by a user.

Description

Data-driven artificial intelligence technology evaluation system and method

Technical Field

The invention relates to the technical field of artificial intelligence evaluation, in particular to a data-driven artificial intelligence technology evaluation system and a data-driven artificial intelligence technology evaluation method.

Background

AI artificial intelligence is ubiquitous, the types of AI technologies are various, and different AI technologies can have different evaluation methods in different specific application scenes. For example, the accuracy evaluation of the speech recognition technology is to calculate the word error rate WER, and the accuracy evaluation of the semantic segmentation technology is to calculate the pixel precision, the average pixel precision and the MIoU. The method is also a voice recognition technology, and when the method is evaluated in an intelligent sound box application scene, evaluation data of different speaking distances, different house reverberation and different types of indoor noise need to be considered, and when the method is evaluated in an intelligent customer service application scene, evaluation data of telephone channel types and different customer service industry professional terms need to be considered. In addition, the AI technology has many evaluation dimensions, and whether the function meets the requirements, the accuracy reaches the required value, how long the time is spent, the calculation resource consumption is large, and whether the potential safety hazard exists need to be tested in place. Due to the specialty of the artificial intelligence technology, when an AI technology is evaluated, a special evaluation tool needs to be developed according to the characteristics of the AI technology, targeted evaluation data is constructed, an applicable evaluation standard and an industry standard model are searched, and then evaluation can be performed. Wherein, evaluation is an essential link in the artificial intelligence research and development process. The purpose of evaluation is to detect whether a product/system/platform is normal, how the index is, whether an error or a leak exists, and the evaluation has a significant effect. From the view of the artificial intelligence test content, the method not only comprises technical indexes such as function, performance and safety, but also comprises social indexes such as privacy and ethics. The later the test and defect discovery, the higher the cost of remediation. And because the variety of artificial intelligence technology is various, different AI technologies can have different evaluation methods under different specific application scenes, the industry often researches and studies specific evaluation standards according to the characteristics of specific AI technologies, collects corresponding evaluation cases, and develops corresponding evaluation tools to evaluate pertinently, and the whole evaluation process and evaluation experience are difficult to be suitable for other AI technologies or application scenes.

Chinese patent publication No. CN113569988A discloses an algorithm model evaluation method and system, wherein the method includes the following steps: obtaining corpus data, and dividing the corpus data into a plurality of types of corpora according to an application scene; evaluating the new algorithm model and the old algorithm model respectively according to each corpus to obtain corresponding evaluation data; wherein, the new and old algorithm models are obtained by training an algorithm model training platform; judging whether the new algorithm model passes the evaluation according to the evaluation data and a preset evaluation standard, and if so, uploading the new algorithm model; otherwise, improving the new algorithm model according to the evaluation data. However, the patent evaluates the automatic evaluation of a single AI task based on the evaluation steps and evaluation data which are preset manually according to the characteristics of the task, and the whole evaluation process and the evaluation method are difficult to be suitable for evaluating other AI technical models except for intention identification.

Chinese patent publication No. CN112988165A discloses an interactive modeling method, apparatus, electronic device and storage medium based on Kubernetes, relating to the technical field of neural network models. The method comprises the following steps: configuring an instance at a front-end page; building a Kubernetes cluster, and connecting the Kubernetes cluster with a back-end server; starting an interactive modeling environment based on Jupiter notewood or an online integrated development environment in a cluster; and calling a computing framework integrated in the Notebook, or creating a model based on an example by using real-time code completion and error prompt of an online integrated development environment. The method is based on Kubernets and an interactive modeling environment comprising Notebook and an online integrated development environment. However, the patent aims at training and developing the AI model, and at present, a plurality of similar technologies exist in the industry, and the aim is to reduce the threshold of an AI model developer and rapidly develop the AI model through a dynamic interaction environment. However, the training and evaluation of the AI model are two different links, and the method steps for training and evaluating the AI model are completely different and do not relate to the evaluation of the AI technology.

Therefore, there is no general artificial intelligence technology evaluation method in the prior art, and a method and a system for generating an evaluation task and executing evaluation by quickly selecting a suitable evaluation tool, evaluation data, evaluation standard and reference model, outputting an evaluation result, and promoting algorithm optimization of an evaluation object so as to reduce the technical threshold of artificial intelligence evaluation by a user are needed to be designed according to evaluation requirements in different technical fields and different application scenarios.

Disclosure of Invention

The invention provides a data-driven artificial intelligence technology evaluation system and a data-driven artificial intelligence technology evaluation method, which aim at the problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a data-driven artificial intelligence technology evaluating system comprises an evaluating object unit, an evaluating process unit, an evaluating result unit and an algorithm optimizing unit, wherein data information of the evaluating object unit is transmitted to the evaluating process unit for evaluation, and the evaluating result of the evaluating process unit is transmitted to the evaluating result unit; the algorithm optimization unit receives the evaluation result data and transmits the received evaluation result data to be optimized to the evaluation object unit for optimization;

the evaluation process unit comprises an evaluation standard library, an evaluation tool library, an evaluation database, a reference model library and an evaluation scheme library.

Based on the technical scheme, furthermore, the evaluation object unit at least comprises intelligent software, an intelligent model and intelligent hardware.

Based on the above technical solution, further, the evaluation result unit includes a function evaluation result, a performance evaluation result, an implementation evaluation result, and a safety evaluation result.

A data-driven artificial intelligence technology evaluating method comprises the following steps:

step S1: generating an evaluation task;

step S2: executing an evaluation task;

and step S3: and generating an evaluation result.

Based on the above technical solution, further, the process of generating the evaluation task in step S1 is as follows:

selecting one or more evaluation standards, one or more evaluation tools, one or more evaluation data and one or more reference models from an evaluation standard library, an evaluation tool library, an evaluation database and a reference model library; and generating an evaluation task according to the selected evaluation standard, the evaluation tool, the evaluation data and the reference model.

Based on the above technical solution, further, the process of executing the evaluation task in step S2 includes the following steps:

step S21: the evaluation object reads the original data in the evaluation data, and outputs the result after completing model inference, wherein the output form is { original data; model results };

step S22: the evaluation tool reads an index set defined in an evaluation standard as output constraint of the evaluation tool, and reads { original data in evaluation data; manual marking results, and then evaluating the output result of the object { original data; model result calculating each index in the index set;

step S23: outputting an evaluation result in an output form of { index 1= value 1; index 2= value 2; let us turn on, index n = value n }.

Based on the above technical solution, further, the generation of the evaluation result in step S3 is based on the output of the evaluation result, and the generation form of the evaluation result at least includes a result comparison table of the user model and different reference models, a result comparison table of the user model under different evaluation data, a ranking list, and an evaluation report.

Based on the technical scheme, further, after the evaluation result is generated, if the evaluation result of the user model does not reach the initially set target, algorithm optimization is started.

Based on the above technical solution, further, the algorithm optimization at least includes one or more of the following optimization modes:

the optimization method comprises the following steps: the method comprises the steps that a user presets a lower limit threshold of one or more indexes, the lower limit threshold is compared with all index values of an evaluation result, if one index value does not reach the lower limit threshold set by the user, the target is judged not to be reached, and algorithm optimization is started;

the second optimization mode is as follows: if the index values in the evaluation result of the user model are all inferior to the index values in the evaluation result of the reference model, judging that the target is not reached;

the optimization mode is three: and after the evaluation result of the user model is ranked in the ranking list, judging that the evaluation result does not reach the target.

Based on the technical scheme, further, when the target is not reached, the algorithm optimization is automatically prompted to the user, the index which is not reached, the difference value from the target, the related evaluation data and the reference model are all returned to the user, and the user is reminded to carry out targeted algorithm optimization; meanwhile, the optimized user model is used as an evaluation object to continue to execute an evaluation task, and iteration is carried out in a circulating mode until the evaluation result of the user model reaches a target.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a data-driven artificial intelligence technology evaluation system and method, which can quickly select a proper evaluation tool, evaluation data, evaluation standard and reference model and then execute evaluation aiming at evaluation objects in different technical fields and different application scenes, output an evaluation result and promote algorithm optimization of the evaluation objects, thereby greatly reducing the technical threshold of artificial intelligence evaluation by a user.

Drawings

FIG. 1 is a general block diagram of an evaluation system of the present invention;

FIG. 2 is a logic diagram of the evaluation system of the present invention.

Detailed Description

In order to make the purpose and technical solution of the present invention clearer, the following will clearly and completely describe the technical solution of the present invention with reference to the embodiments.

Example 1

As shown in fig. 1 and fig. 2, a data-driven artificial intelligence technology evaluation system includes an evaluation object unit, an evaluation process unit, an evaluation result unit and an algorithm optimization unit, wherein data information of the evaluation object unit is transmitted to the evaluation process unit for evaluation, and an evaluation result of the evaluation process unit is transmitted to the evaluation result unit; the algorithm optimization unit receives the evaluation result data and transmits the received evaluation result data to be optimized to the evaluation object unit for optimization; the evaluation object is an AI model or algorithm, which can be a user model or a benchmark model, the benchmark model is from a benchmark model library and is used for comparing evaluation results with the user model, and the evaluation object unit at least comprises intelligent software, an intelligent model and intelligent hardware.

The evaluation process unit comprises an evaluation standard library, an evaluation tool library, an evaluation database, a reference model library, an evaluation scheme library, evaluation task generation, evaluation task execution, evaluation result generation and the like, wherein the evaluation standard library, the evaluation tool library, the evaluation database, the reference model library and the evaluation scheme library form evaluation knowledge base management, and the evaluation standard library provides a methodology basis for evaluation and is generally derived from international standards, national standards and industrial standards related to artificial intelligence. Each evaluation standard comprises an evaluation standard document and an evaluation index list, wherein the evaluation standard document is a document in a pdf or word format and the like and is used for evaluating user browsing; the evaluation index list is a structured index table, for example, the index table may include contents such as evaluation index name, evaluation index description, and the like, and is used to guide execution and output of an evaluation tool, and the output of the evaluation tool needs to conform to the definition of the evaluation index list. The evaluation index list can be generated by refining from the evaluation standard document through manual or open semi-structured data analysis technology. As shown in Table 1, the metadata for each evaluation criteria includes the following fields, each of which defines a uniform writing specification. As shown in Table 2, the metadata for each reference model includes the following fields, each of which defines a uniform writing specification.

TABLE 1 evaluation criteria metadata field requirements

TABLE 2 metadata field requirements of the reference model

The evaluation tool library provides an executive program basis for evaluation, and comprises evaluation tools of various AI tasks, and each evaluation tool is an executable program. And the evaluation tool reads the original data in the evaluation data and compares the result after the input model is inferred with the manual marking result in the evaluation data. The output index of the evaluation tool needs to be consistent with the corresponding evaluation standard. The evaluation database provides a data base for evaluation, including various test set data. The evaluation data set comprises two parts, namely a complete data packet and a sample data packet, wherein the complete data packet comprises two parts, namely original data and a manual marking result, and the catalog is read for inference and comparison during evaluation. The sample data packet is a sample of the data set and is used for showing the evaluation user so as to visually know the details of the evaluation data. The benchmark model represents a technical benchmark in the corresponding technical field and is used for comparing and evaluating with the user model, so that the user can conveniently know the technical advantages and the defects of the benchmark model, and the benchmark model is packaged in a form of Docker or an online API. As shown in Table 3, the metadata for each evaluation tool includes the following fields, each of which defines a uniform writing specification.

TABLE 3 metadata field requirements for evaluation tools

Storing evaluation data in an evaluation database, wherein the evaluation data is a result of manually marking and checking original data, the manually marked result is a reference answer, an evaluation object reads the original data in the evaluation data and outputs a model result after model inference, and the evaluation database is used for providing a data basis for evaluation and comprises various test set data; the evaluation standard library stores evaluation standards, the evaluation standards comprise standard documents and evaluation index sets extracted from the standard documents, and the evaluation standards are used for providing guidance of methodology for evaluation and generally originate from international, national and industrial artificial intelligence-related standard documents; the evaluation tool library stores evaluation tools which are executable programs, and the evaluation tools are used for calculating the model output result and the artificial marking result in the evaluation result based on each evaluation index defined in the evaluation standard, wherein the model output is AI model output, the value of each evaluation index is output as the final evaluation result, the final evaluation result is mainly embodied as test software or test script which is a program used for executing certain test functions, and the evaluation tools operate under the guidance of the evaluation standard. The evaluation result unit comprises a function evaluation result, a performance evaluation result, an implementation evaluation result and a safety evaluation result. The benchmark model represents a technical benchmark in the corresponding technical field and is used for comparing and evaluating an object, so that a user can conveniently know own technical advantages and defects. The metadata for each profile includes the following fields, each of which defines a uniform writing specification, as shown in Table 4 below.

TABLE 4 metadata field requirements for profile

The evaluation scheme in the evaluation scheme library is convenient for a user to create an evaluation task, and the evaluation scheme can be created in advance for the user to directly select. The evaluation scheme is a process and comprises an evaluation tool, evaluation data, a reference model and an evaluation standard, for example, the face recognition precision evaluation scheme can comply with the face recognition precision test requirement in the face recognition evaluation standard, industrial standard LFW face evaluation data, face evaluation data under different shielding states, face evaluation data of different races and different ages are used, the evaluation tools capable of calculating accuracy, precision, call, F1 and AUC are used for testing, and the evaluation results are compared with the internal standard face recognition models such as faceNet, VGGNet and GoogleNet. One evaluation task comprises one or more evaluation tools, one or more evaluation data sets, one or more evaluation standards and one or more benchmark models. The same evaluation tool or evaluation data or evaluation criteria or reference model can be included in a plurality of evaluation tasks. An evaluation task may reference zero, one, or multiple evaluation scenarios.

Example 2

Based on the data-driven artificial intelligence technology evaluation system of the embodiment 1, a data-driven artificial intelligence technology evaluation method is implemented, which includes the following steps:

step S1: generating an evaluation task;

specifically, the evaluation task is a one-time evaluation initiated by a user, and the user can select a suitable evaluation tool, evaluation data, reference model and evaluation standard to create a flow, so as to determine what evaluation standard the evaluation task follows when being executed, what evaluation data is used, what evaluation tool is used for execution, and the evaluation result is compared with what reference model.

The process of generating the evaluation task comprises the following steps: selecting one or more evaluation standards, one or more evaluation tools, one or more evaluation data and one or more reference models from an evaluation standard library, an evaluation tool library, an evaluation database and a reference model library; and generating an evaluation task according to the selected evaluation standard, the evaluation tool, the evaluation data and the reference model.

The specific generation process is as follows: an evaluation tool is associated with an evaluation criterion indicating that an evaluation is to be performed in compliance with the evaluation criterion. An evaluation tool correlates one or more evaluation data, representing the reading of diverse evaluation data to perform an evaluation. And an evaluation tool is associated with one or more benchmark models and represents that evaluation results are output and compared between the user model and one or more industry benchmark models. A benchmark model is associated with one or more evaluation tools, and represents that one or more evaluation dimensions can be evaluated simultaneously in an evaluation task. For example: the human face recognition precision evaluation task can select a human face recognition evaluation standard as an evaluation standard, a human face recognition evaluation tool as an evaluation tool, multiple human face data with multiple postures and multiple shielding state human face data as evaluation data, a faceNet human face recognition model and a VGGNet human face recognition model as reference models, and the values of two indexes of a false recognition rate and a rejection rate are calculated. Optionally, the evaluation task can also be generated by selecting an evaluation scheme from an evaluation scheme library. One evaluation task can select one or more evaluation schemes, if a plurality of evaluation schemes are selected, the plurality of evaluation schemes can be connected in series or in parallel, wherein the series connection represents that the plurality of evaluation schemes are sequentially executed according to a connection sequence when the evaluation task is executed, and the parallel connection represents that the plurality of evaluation schemes are simultaneously and concurrently executed when the evaluation task is executed. For example: the face recognition evaluating task can select a face recognition precision evaluating scheme and a face recognition safety evaluating scheme to evaluate the face recognition model from the two aspects of precision and safety.

Step S2: executing an evaluation task;

as shown in FIG. 2 for logical associations, the evaluation tool is the subject of evaluation execution. The process of specifically executing the evaluation task comprises the following steps:

step S21: the evaluation objects (the user model and each reference model) read the original data in the evaluation data through optional modes such as an evaluation data read-write interface and the like, and output results after model inference is finished, wherein the output form is { original data; model results };

step S22: an evaluation tool reads an index set { index 1; index 2; taking the index n as output constraint of an evaluation tool, and reading { original data in evaluation data; manual marking results, and then evaluating the output result of the object { original data; model result calculating each index in the index set;

And step S3: and generating an evaluation result.

The generation of the evaluation result is based on the output of the evaluation result, and specifically, after the evaluation task is executed, the output result of the evaluation tool can be packaged in various forms to generate a final evaluation result. The form of the evaluation results is, for example:

1. and comparing results of the user model and different reference models: by comparing the evaluation effects of the user model and the industry benchmark model under the same evaluation data, the evaluation user can conveniently and quantitatively master the promotion degree of the self model. For example: the face recognition precision evaluation can compare the precision of a user model and industry standard face recognition models such as faceNet, VGGNet, googleNet and the like.

2. And (3) comparing results of the user model under different evaluation data: the method is suitable for the evaluation task containing a plurality of evaluation data. By comparing the user model evaluation results under different evaluation data, the evaluation user can more clearly master the good effect of the self model under which occasions, and the effect is general under which occasions. For example: the face recognition precision evaluation can adopt the non-shielding face evaluation data and the shielding face evaluation data to compare the accuracy of the face recognition technology under two occasions of not shielding the face and shielding the face.

3. Ranking list: when a plurality of evaluation users all participate in the same type of evaluation task, the evaluation users can be arranged in a descending order according to the evaluation results of the evaluation users to form a ranking list of the type of evaluation task.

4. Evaluation report: and (4) outputting the evaluation task information, the evaluation process and the evaluation tool, and organically organizing to form an evaluation report document for an evaluation user to use.

After the evaluation result is generated, if the evaluation result of the user model does not reach the initially set target, algorithm optimization is started. The algorithm optimization at least comprises one or more of the following optimization modes:

the optimization method comprises the following steps: the user presets a lower threshold of one or more indexes, such as { the false recognition rate is 5%; the rejection rate is 7% }, the lower limit threshold value is compared with each index value of the evaluation result, if one index value does not reach the lower limit threshold value set by the user, the target is judged not to be reached, and algorithm optimization is started;

and the second optimization mode is as follows: each index value in the evaluation result of the user model is different from each index value in the evaluation result of the reference model, and if the user model is not as good as the industry reference model, the user model is judged to be not up to the target;

When the target is not reached, automatically prompting a user that algorithm optimization is required, and returning the index which is not reached, the difference value from the target, the related evaluation data and the reference model to the user to prompt the user to carry out targeted algorithm optimization; meanwhile, the optimized user model is used as an evaluation object to continue to execute an evaluation task, and iteration is carried out in a circulating mode until the evaluation result of the user model reaches a target.

Finally, it should be noted that the above-mentioned contents are only used for illustrating the technical solutions of the present invention, and do not limit the protection scope of the present invention, and those skilled in the art can make simple modifications or equivalent substitutions on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data-driven artificial intelligence technology evaluating system is characterized by comprising an evaluating object unit, an evaluating process unit, an evaluating result unit and an algorithm optimizing unit, wherein data information of the evaluating object unit is transmitted to the evaluating process unit for evaluation, and the evaluating result of the evaluating process unit is transmitted to the evaluating result unit; the algorithm optimization unit receives the evaluation result data and transmits the received evaluation result data to be optimized to the evaluation object unit for optimization;

2. The data-driven artificial intelligence technology evaluation system according to claim 1, wherein the evaluation object unit includes at least intelligent software, an intelligent model, and intelligent hardware.

3. The data-driven artificial intelligence technology evaluation system of claim 1, wherein the evaluation result unit comprises a function evaluation result, a performance evaluation result, an implementation evaluation result and a safety evaluation result.

4. A data-driven artificial intelligence technique evaluating method using the data-driven artificial intelligence technique evaluating system of any one of claims 1 to 3, comprising the steps of:

step S1: generating an evaluation task;

step S2: executing an evaluation task;

and step S3: and generating an evaluation result.

5. The data-driven artificial intelligence technology evaluation method according to claim 4, wherein the process of generating the evaluation task in step S1 is as follows:

6. The data-driven artificial intelligence technology evaluation method according to claim 5, wherein the process of performing the evaluation task in step S2 comprises the following steps:

step S21: the evaluation object reads the original data in the evaluation data, and outputs the result after completing model estimation, wherein the output form is { original data; model results };

7. The data-driven artificial intelligence technology evaluation method of claim 6, wherein the evaluation result is generated in step S3 based on the output of the evaluation result, and the generation form of the evaluation result at least includes a result comparison table of the user model and different reference models, a result comparison table of the user model under different evaluation data, a ranking list, and an evaluation report.

8. The data-driven artificial intelligence technology evaluation method of claim 7, wherein after the evaluation result is generated, if the evaluation result of the user model does not reach the initially set target, algorithm optimization is started.

9. The method for evaluating data-driven artificial intelligence technology of claim 8, wherein the algorithm optimization comprises at least one or more of the following optimization modes:

and the second optimization mode is as follows: if the index values in the evaluation result of the user model are all different from the index values in the evaluation result of the reference model, judging that the target is not reached;

10. The data-driven artificial intelligence technology evaluation method of claim 9, wherein when the target is not reached, automatically prompting a user that algorithm optimization is needed, and returning the index not reached, the difference value from the target, the related evaluation data and the reference model to the user to prompt the user to perform targeted algorithm optimization; meanwhile, the optimized user model is used as an evaluation object to continue to execute an evaluation task, and iteration is carried out in a circulating mode until the evaluation result of the user model reaches a target.