WO2019095899A1 - Material annotation method and apparatus, terminal, and computer readable storage medium - Google Patents

Material annotation method and apparatus, terminal, and computer readable storage medium Download PDF

Info

Publication number
WO2019095899A1
WO2019095899A1 PCT/CN2018/109774 CN2018109774W WO2019095899A1 WO 2019095899 A1 WO2019095899 A1 WO 2019095899A1 CN 2018109774 W CN2018109774 W CN 2018109774W WO 2019095899 A1 WO2019095899 A1 WO 2019095899A1
Authority
WO
WIPO (PCT)
Prior art keywords
algorithm model
labeling
annotation
training set
preset algorithm
Prior art date
Application number
PCT/CN2018/109774
Other languages
French (fr)
Chinese (zh)
Inventor
陆艳
刘勇
高洪
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019095899A1 publication Critical patent/WO2019095899A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present disclosure relates to the field of wireless communication technologies, for example, to a material annotation method and apparatus, a terminal, and a computer readable storage medium.
  • the labeling and proofreading of various materials has always required a lot of time and manpower.
  • the labeling and proofreading of the material needs to be obtained by analyzing a large amount of training materials. These training materials are marked in advance according to a certain logic, which is usually manually labeled, and the labeling process requires a lot of manpower and time.
  • the process of labeling is actually the process of interpreting the features in the material. Different people may have different interpretation results, so the material labeling is very subjective.
  • the knowledge structure and grammar theory of different labelers are also different, which makes the results of labeling different and difficult to unify.
  • the embodiment of the present application provides a material labeling method and device, a terminal, and a computer readable storage medium, and aims to solve the problem that the material labeling is time-consuming and labor-intensive in the related art, and the labeling result is difficult to be unified.
  • the embodiment of the present application provides a material labeling method, and the material labeling method includes: labeling materials in the labeling material set according to a preset algorithm model; and generating a training set corresponding to the labeling result based on the labeling result; The training set updates the preset algorithm model for the next material annotation.
  • the embodiment of the present application further provides a material labeling device, including: a material labeling module, a training generating module, and an algorithm training module.
  • the material labeling module is set to label the materials in the annotation material set according to the preset algorithm model.
  • the training generation module is configured to generate a training set corresponding to the labeled result based on the result of the annotation.
  • An algorithm training module is configured to update the preset algorithm model through the training set for the next material annotation.
  • the embodiment of the present application further provides a terminal, including a processor, a memory, and a communication bus; the communication bus is configured to implement connection communication between the processor and the memory; and the processor is configured to perform storage in the memory
  • the material labeling program to implement the aforementioned material labeling method.
  • the embodiment of the present application further provides a computer readable storage medium storing at least one computer program executable by at least one processor to implement the foregoing material labeling method.
  • FIG. 1 is a flow chart of a material labeling method according to a first embodiment of the present application
  • FIG. 2 is a schematic diagram of a material labeling according to a first embodiment of the present application.
  • FIG. 3 is a detailed flowchart of a material labeling method according to a second embodiment of the present application.
  • FIG. 4 is a schematic diagram of material labeling according to a third embodiment of the present application.
  • FIG. 5 is a schematic diagram of a composition of a material marking device according to a fourth embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a terminal according to a fifth embodiment of the present application.
  • FIG. 1 is a flowchart of a material labeling method according to a first embodiment of the present application, where the method includes steps S101-S103.
  • the materials in the annotation material set are marked according to a preset algorithm model.
  • the preset algorithm model is updated by the training set for the next material annotation.
  • the material may include corpus in the intelligent question answering system, text in the text recognition, and multimedia materials such as audio and video, pictures, and the like. These materials often contain a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Label in a computer-readable way.
  • the information presented in the picture material in the picture material is marked in the form of text, or face recognition, which is marked with the pixel coordinates and pixel values of the facial features in the image, or as the corpus in the corpus, Label a variety of linguistic features on the corresponding language components for easy identification and reading by the computer.
  • the way of labeling differs according to the application scene. In principle, based on certain logic, multiple features of the material to be labeled are computer-identifiable.
  • the algorithm model is the algorithm referenced by the annotation material, and the algorithm model referenced by the subsequent material annotation is the algorithm model determined after the last material annotation.
  • the algorithm model needs to be obtained through an analysis training set. Among them, the algorithm model is roughly divided into two types: the initial algorithm model and the transition algorithm model, depending on the timing of generation.
  • the initial algorithm model that is, the first algorithm model in this material annotation, this algorithm model roughly determines the algorithm logic of all relevant material annotations in the future.
  • the transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
  • determining the generated algorithm model may include: manually labeling the material in the initial material set to generate an initial training set; and based on the initial training set, training to generate an initial algorithm model; referring to the initial algorithm model to treat the labeled material set The material is marked, and the initial algorithm model is updated based on the annotation result to form a transition algorithm model; the reference transition algorithm model is used to label the material of the next to be labeled material set, and the transition algorithm model is updated based on the annotation result, so that the material annotation is performed iteratively.
  • the algorithm model is updated to determine the algorithm model.
  • the initial algorithm model is generated by first marking the material in the initial material set by manual labeling.
  • the manual annotation here has no reference to the algorithm model, and the human cognition comes from determining how to mark multiple features of the material.
  • the initial training set corresponding to the result of the labeling is generated with reference to the labeling result.
  • the training set is a set of training generation algorithm models. There are often a large number of objects in the training set. Training these objects can generate the desired algorithm model.
  • the initial training set is the initial training set used to train the algorithm model.
  • the initial algorithm model is obtained.
  • verification can also be performed, and the verification can be performed by other people, which is equivalent to referencing multiple verifications. To determine the initial algorithm model.
  • the initial algorithm model After the initial algorithm model is determined, it is used as the algorithm model for the second material annotation, which is the reference algorithm model of the next algorithm model.
  • the material After the material is labeled with reference to the initial algorithm model, the corresponding labeling result and the training set generated according to the labeling result are obtained; this is a new training set different from the initial training set, and the second material is The material in the annotation is often different from the material in the first time.
  • the training set obtained after labeling with the same algorithm model is used as an update package of the initial algorithm model to update the initial algorithm model, so that the initial algorithm model can be included. More detailed algorithm models.
  • the algorithm model obtained at this time is no longer the initial algorithm model, but the transition algorithm model in the algorithm model.
  • the multiple transition algorithm models are obtained by updating the algorithm model after each material model is labeled with the material, in other words, each time.
  • the material labeling refers to the algorithm model updated after the last material labeling, and after the material labeling, the updated algorithm model is used as the algorithm model referenced by the next material labeling. In this way, the more iterations, the wider the coverage of the algorithm model, the more types and fields of material involved, and the higher the accuracy of subsequent material markers.
  • the material to be labeled in the material set is marked according to the algorithm model.
  • the annotation process here is an iterative process of labeling the material in the annotation material set using the material from the previous material set.
  • the labeling the material in the annotation material set according to the algorithm model may include: determining a first material to be labeled, the same material as the algorithm model domain, and a second material different from the algorithm model domain; directly passing the algorithm The model labels the first material; and, by manual labeling, the second material.
  • the materials in the material to be labeled can be roughly divided into two categories: one can be directly labeled by the algorithm model, and the material is the same as the first model in the domain of the algorithm model; one class cannot directly pass the algorithm model. For the annotation, this kind of material is the second material different from the algorithm model field.
  • the first material can be directly labeled because the domain is consistent with the algorithm model. Of course, it may also encounter the same field but the categories under the domain are different. As a result, some parts cannot be directly labeled, and they can also be marked by manual labeling, that is, For the first material, the part that cannot be directly labeled by the algorithm model is labeled by manual annotation; the second material cannot be directly labeled because the domain is different from the algorithm model, and is usually directly by manual labeling.
  • the manner of determining the first material and the second material in the set of materials to be labeled is generally determined by the material provider in advance, and the area to which the material to be labeled belongs is often known before the labeling; if the material provider does not explicitly provide, It can be done by means of keyword screening, or by manual participation, or it can be directly assumed that the parts are directly labeled in the same field, and the parts that cannot be directly labeled are separated and used as the second material of different fields for manual labeling. .
  • FIG. 2 shows a schematic diagram of material annotation, in which material A is used as an initial material, and is manually labeled to generate training set A, and the algorithm model is trained based on training set A, where
  • the initial algorithm model; material B as the same material as the material A field, that is, the material consistent with the algorithm model domain, can be directly labeled by the automatic annotation device integrated with the algorithm model.
  • the automatic annotation device not only integrates the algorithm model, but also has some other components required for annotation, such as workflow, permission control and other related functions.
  • material B has different types of material B' under the domain, which cannot be directly labeled by the algorithm model, but is manually labeled; material C is a different material from the material A field, that is, the algorithm Materials that are inconsistent in the model domain are directly labeled by manual annotation.
  • the corresponding training set is finally generated, and the algorithm model is updated by the training set as the algorithm model referenced for the next material labeling.
  • the evaluation algorithm model is to be labeled according to the proportion of the first material in the material to be labeled, and/or the accuracy of each labeling. Whether the material's marking ability is up to standard.
  • the corresponding first material and second material are correspondingly generated, according to the first material that can be directly labeled in the material to be labeled.
  • the proportion of the algorithm can determine the labeling ability of the algorithm model; in addition, each time the label of the material set to be labeled is checked, the accuracy of the labeling can be known, and the marking ability of the algorithm model can also be determined according to the accuracy rate. After learning the labeling ability of the algorithm model, if the labeling ability of the algorithm model is weak, or the labeling ability is not up to standard, it may be necessary to continue to use the material set for training, and gradually improve the algorithm model labeling ability.
  • a corresponding training set is generated based on the result of the labeling. Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
  • the algorithm model is updated by the training set for the next material annotation.
  • the next material annotation is generally performed with reference to the updated algorithm model of the previous material annotation. The more iterations, the wider the coverage area, so the fewer times the manual participation is required, and the accuracy of the annotation is also The higher.
  • updating the algorithm model through the training set may include: verifying the training set; and after the verification is completed, updating the algorithm model through the verified training set.
  • the verifying the training set based on the result of the labeling may include: randomly extracting the part from the training set for verification in a random manner; or directly verifying all contents in the training set in a full amount manner.
  • the embodiment provides a material labeling method, and determines the generated algorithm model.
  • the algorithm model is used for material labeling, and the material of the labeling material set is marked according to the algorithm model, and the corresponding training set is generated based on the labeling result, and updated by the training set.
  • Algorithm model for the next material annotation Through the implementation of the embodiment, the algorithm model is updated with each labeled result, thereby reducing the workload of manual annotation, and also improving the consistency and accuracy of the annotation.
  • FIG. 3 is a detailed flowchart of a material labeling method according to a second embodiment of the present application.
  • the material can include corpus in the intelligent question answering system, text in text recognition, and multimedia materials such as audio and video and pictures.
  • the material to be labeled is marked based on the algorithm model after the last material labeling.
  • the material often contains a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Label the way in a computer-readable way.
  • the algorithm model can be roughly divided into two types: the initial algorithm model and the transition algorithm model.
  • the initial algorithm model that is, the first algorithm model in this material annotation, this algorithm model roughly determines all relevant correlations in the future.
  • the transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
  • Determining whether the label is successful that is, the process of filtering the first material and the second material from the material; and determining the manner of the first material and the second material in the material set to be labeled may be performed by keyword screening, or It is judged by human participation, or it is directly assumed that the parts are directly labeled in the same field, and the parts that cannot be directly labeled are separated and used as the second material of different fields for manual labeling.
  • a training set is generated based on the labeling result.
  • Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
  • the training set may be verified.
  • the verification method may include: randomly extracting parts from the training set for verification by means of random inspection; or directly verifying all the training sets in a full amount manner. content.
  • FIG. 4 is a schematic diagram of a material labeling method according to a third embodiment of the present invention.
  • the material labeling method in the embodiment is based on a bank business corpus, and the corresponding algorithm model and loop iteration are trained.
  • the implementation steps are as follows S401-S408.
  • the algorithm model of the corpus training set based on the bank A is generated and embedded in the smart labeling system.
  • step S404 when the business corpus of the third batch of bank C needs to be labeled, the operation in step S404 is repeated, thereby realizing the updating of the algorithm model, and the intelligent labeling system is once again optimized and expanded.
  • step S404 when the customer service corpus of the fifth batch of e-commerce needs to be labeled, the operation in step S404 is repeated, thereby realizing the update of the algorithm model, and the intelligent annotation system is optimized and expanded for the fifth time.
  • the analysis can only indicate the automatic indexing ratio and accuracy of the system for different sub-categories in the same field and the same sub-categories in the same field. It can be judged whether more rich corpus needs to be collected to continue training the algorithm model.
  • FIG. 5 is a schematic diagram of a composition of a material labeling device according to a fourth embodiment of the present invention.
  • the material labeling device includes a material labeling module 501, a training generating module 502, and an algorithm training module 503.
  • the material labeling module 501 is configured to label the materials in the annotation material set according to the preset algorithm model.
  • the training generation module 502 is configured to generate a training set corresponding to the labeled result based on the result of the annotation.
  • the algorithm training module 503 is configured to update the algorithm model through the training set for the next material annotation.
  • the material is marked, wherein the material may include corpus in the intelligent question answering system, text in the text recognition, and multimedia materials such as audio and video, pictures, and the like. These materials often contain a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Labeling in a computer-recognizable manner, for example, marking information in a picture material in the form of a text in the form of text, or face recognition, using pixel coordinates and pixel values of facial features in the image Labeling, or corpus in the corpus, labels a variety of linguistic features on the corresponding language components to facilitate computer identification and reading.
  • the way of labeling differs according to the application scenario. In principle, based on certain logic, multiple features of the material to be labeled are computer-identifiable.
  • the algorithm model is the algorithm referenced by the annotation material, and the algorithm model referenced by the subsequent material annotation is the algorithm model determined after the last material annotation.
  • the algorithm model needs to be obtained through the analysis training set.
  • the algorithm model is roughly divided into two types: the initial algorithm model and the transition algorithm model according to the different generation timing.
  • the initial algorithm model that is, the first algorithm model in this material annotation, this algorithm model roughly determines the algorithm logic of all relevant material annotations in the future.
  • the transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
  • determining the generated algorithm model may include: manually labeling the materials in the initial material set to generate an initial training set; and the training generating module 502 is configured to generate an initial algorithm model based on the initial training set; the material labeling module 501 Referring to the initial algorithm model, the material in the annotation material set is marked; the algorithm training module 503 updates the initial algorithm model based on the annotation result to form a transition algorithm model; the material labeling module 501 refers to the transition algorithm model to label the material of the next to be labeled material set. The algorithm training module 503 then updates the transition algorithm model based on the annotation result, so as to iteratively update the material annotation and the algorithm model, and determine the algorithm model.
  • the above steps show a general generation manner of the algorithm model.
  • the algorithm model is formed based on the initial algorithm model after several iterations of the annotation update.
  • An alternative way to generate the initial algorithm model is to first label the material in the initial material set by manual annotation.
  • the manual annotation here has no reference to the algorithm model, and the human cognition comes from determining how to mark multiple features of the material.
  • the corresponding initial training set is generated with reference to the labeling result.
  • the training set is a set of training generation algorithm models. There are often a large number of objects in the training set. Training these objects can generate the desired algorithm model.
  • the initial training set is the initial training set used to train the algorithm model.
  • the initial algorithm model is obtained. At this time, since the initial material set is marked by manual labeling, in order to ensure the reliability of the obtained initial algorithm model, verification can also be performed, and the verification can be performed by other people, which is equivalent to referencing multiple verifications. To determine the initial algorithm model.
  • the initial algorithm model After the initial algorithm model is determined, it is used as the algorithm model for the second material annotation, which is the reference algorithm model of the next algorithm model.
  • the material After the material is labeled with reference to the initial algorithm model, the corresponding labeling result and the training set generated according to the labeling result are obtained; this is a new training set different from the initial training set, and the second material is The material in the annotation is often different from the material in the first time.
  • the training set obtained after labeling with the same algorithm model is used as an update package of the initial algorithm model to update the initial algorithm model, so that the initial algorithm model can be included. More detailed algorithm models.
  • the algorithm model obtained at this time is no longer the initial algorithm model, but the transition algorithm model in the algorithm model.
  • the multiple transition algorithm models are obtained by updating the algorithm model after each material model is labeled with the material, in other words, each time.
  • the material labeling refers to the algorithm model updated after the last material labeling, and after the material labeling, the updated algorithm model is used as the algorithm model referenced by the next material labeling. In this way, the more iterations, the wider the coverage of the algorithm model, the more types and fields of material involved, and the higher the accuracy of subsequent material markers.
  • the material labeling module 501 is configured to label the material to be labeled in the material set according to the algorithm model.
  • the labeling process here is the next iteration of the material labeling in the previous material set; and in an embodiment, labeling the material in the labeling material set according to the algorithm model may include: determining the material set to be labeled, which is the same as the algorithm model domain.
  • the first material, and the second material different from the algorithm model domain; the first material is directly labeled by the algorithm model; and the second material is labeled by manual labeling.
  • the materials in the material to be labeled can be roughly divided into two categories: one can be directly labeled by the algorithm model, and the material is the same as the first model in the domain of the algorithm model; one class cannot directly pass the algorithm model.
  • this kind of material is the second material different from the algorithm model field.
  • the first material can be directly labeled because the domain is consistent with the algorithm model. Of course, it may also encounter the same field but the categories below the domain are different, resulting in some parts that cannot be directly labeled, and can also be marked by manual labeling; Because the material is different from the algorithm model, it cannot be directly labeled, and it is often taken directly by manual labeling.
  • the method for determining the first material and the second material in the material set to be labeled may be performed by keyword screening or the like, or may be judged by manual participation, or directly assumed to be directly labeled in the same field. For parts that cannot be directly labeled, they are separated and used as the second material of different fields for manual labeling.
  • FIG. 2 shows a schematic diagram of material annotation, in which material A is used as an initial material, and is manually labeled to generate training set A, and the algorithm model is trained based on training set A, where
  • the material B' cannot be directly labeled by the algorithm model, but is manually labeled;
  • the material C is a material different from the material A field, that is, a material that is inconsistent with the algorithm model field, and is directly labeled by manual labeling.
  • the algorithm model is updated by the training set as the algorithm model referenced for the next material labeling. .
  • the training generation module 502 is configured to generate a corresponding training set based on the results of the annotations. Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
  • the algorithm training module 503 is configured to update the algorithm model through the training set for the next material annotation.
  • the next material annotation is generally carried out with reference to the updated algorithm model of the previous material annotation. The more iterations, the wider the coverage area, the less the number of manual interventions required, and the accuracy of labeling. The higher.
  • updating the algorithm model through the training set may include: verifying the training set based on the result of the annotation; and after the verification is completed, updating the algorithm model through the verified training set.
  • the verifying the training set based on the result of the labeling may include: randomly extracting the part from the training set for verification in a random manner; or directly verifying all contents in the training set in a full amount manner.
  • the embodiment provides a material labeling device, which determines the generated algorithm model, the algorithm model is used for material labeling, and the material of the labeling material set is marked according to the algorithm model, and the corresponding training set is generated based on the labeling result, and is updated by the training set. Algorithm model for the next material annotation.
  • the algorithm model is updated with each labeled result, thereby reducing the workload of manual annotation, and also improving the consistency and accuracy of the annotation.
  • FIG. 6 is a schematic structural diagram of a terminal according to a fifth embodiment of the present disclosure, including:
  • the embodiment further provides a computer readable storage medium, where the computer readable storage medium stores one or more computer programs, and the computer program can be executed by one or more processors to implement the foregoing material labeling. Embodiments of the method are not described herein again.
  • modules or steps of the present application can be implemented by a general computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in a storage medium (ROM/RAM, diskette, optical disk) by a computing device, and in some cases The steps shown or described may be performed in an order different than that herein, or they may be separately fabricated into individual integrated circuit modules, or a plurality of the modules or steps may be implemented as a single integrated circuit module. Therefore, the application is not limited to any particular combination of hardware and software.

Abstract

Disclosed are a material annotation method and apparatus, a terminal, and a computer readable storage medium. The method comprises: annotating a material in a set of materials to be annotated, according to a preset algorithm model; generating a training set corresponding to the annotation result, based on the annotation result; and updating the algorithm model by means of the training set, for use in a next material annotation.

Description

素材标注方法以及装置、终端和计算机可读存储介质Material labeling method and device, terminal and computer readable storage medium
本申请要求在2017年11月17日提交中国专利局、申请号为201711148095.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. PCT Application No.
技术领域Technical field
本公开涉及无线通信技术领域,例如涉及一种素材标注方法以及装置、终端和计算机可读存储介质。The present disclosure relates to the field of wireless communication technologies, for example, to a material annotation method and apparatus, a terminal, and a computer readable storage medium.
背景技术Background technique
在人工智能飞速发展的今天,对于多种素材的标注和校对工作一直需要消耗大量的时间和人力。素材的标注和校对需要通过分析大量的训练素材得到,这些训练素材被事先按照一定的逻辑进行标注,通常都是人工标注,标注过程需要耗费大量的人力和时间。标注的过程实际上是对素材中的特征进行解释的过程,不同的人可能会有不同的解释结果,所以素材标注带有很大的主观性。不同的标注者的知识结构和语法理论也各不相同,导致标注的结果各不相同难以统一。Today, with the rapid development of artificial intelligence, the labeling and proofreading of various materials has always required a lot of time and manpower. The labeling and proofreading of the material needs to be obtained by analyzing a large amount of training materials. These training materials are marked in advance according to a certain logic, which is usually manually labeled, and the labeling process requires a lot of manpower and time. The process of labeling is actually the process of interpreting the features in the material. Different people may have different interpretation results, so the material labeling is very subjective. The knowledge structure and grammar theory of different labelers are also different, which makes the results of labeling different and difficult to unify.
发明内容Summary of the invention
本申请实施例提供了一种素材标注方法以及装置、终端和计算机可读存储介质,旨在解决相关技术中素材标注耗时耗力,且标注结果难以统一的问题。The embodiment of the present application provides a material labeling method and device, a terminal, and a computer readable storage medium, and aims to solve the problem that the material labeling is time-consuming and labor-intensive in the related art, and the labeling result is difficult to be unified.
本申请实施例提供了一种素材标注方法,所述素材标注方法包括:根据预设的算法模型对待标注素材集中的素材进行标注;基于标注的结果,生成与标注的结果对应的训练集;通过所述训练集更新所述预设的算法模型,用于下一次的素材标注。The embodiment of the present application provides a material labeling method, and the material labeling method includes: labeling materials in the labeling material set according to a preset algorithm model; and generating a training set corresponding to the labeling result based on the labeling result; The training set updates the preset algorithm model for the next material annotation.
本申请实施例还提供一种素材标注装置,包括:素材标注模块、训练生成模块和算法训练模块。The embodiment of the present application further provides a material labeling device, including: a material labeling module, a training generating module, and an algorithm training module.
素材标注模块,设置为根据预设的算法模型对待标注素材集中的素材进行标注。The material labeling module is set to label the materials in the annotation material set according to the preset algorithm model.
训练生成模块,设置为基于标注的结果,生成与标注的结果对应的训练集。The training generation module is configured to generate a training set corresponding to the labeled result based on the result of the annotation.
算法训练模块,设置为通过所述训练集更新所述预设的算法模型,用于下 一次的素材标注。An algorithm training module is configured to update the preset algorithm model through the training set for the next material annotation.
本申请实施例还提供一种终端,包括处理器、存储器和通信总线;所述通信总线设置为实现所述处理器和存储器之间的连接通信;所述处理器设置为执行所述存储器中存储的素材标注程序,以实现前述的素材标注方法。The embodiment of the present application further provides a terminal, including a processor, a memory, and a communication bus; the communication bus is configured to implement connection communication between the processor and the memory; and the processor is configured to perform storage in the memory The material labeling program to implement the aforementioned material labeling method.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有至少一个计算机程序,所述计算机程序可被至少一个处理器执行以实现前述的素材标注方法。The embodiment of the present application further provides a computer readable storage medium storing at least one computer program executable by at least one processor to implement the foregoing material labeling method.
附图概述BRIEF abstract
图1为本申请第一实施例提供的一种素材标注方法流程图;1 is a flow chart of a material labeling method according to a first embodiment of the present application;
图2为本申请第一实施例提供的一种素材标注示意图;2 is a schematic diagram of a material labeling according to a first embodiment of the present application;
图3为本申请第二实施例提供的一种素材标注方法细化流程图;3 is a detailed flowchart of a material labeling method according to a second embodiment of the present application;
图4为本申请第三实施例提供的一种素材标注示意图;4 is a schematic diagram of material labeling according to a third embodiment of the present application;
图5为本申请第四实施例提供的一种素材标注装置组成示意图;FIG. 5 is a schematic diagram of a composition of a material marking device according to a fourth embodiment of the present application; FIG.
图6为本申请第五实施例提供的一种终端组成示意图。FIG. 6 is a schematic structural diagram of a terminal according to a fifth embodiment of the present application.
具体实施方式Detailed ways
第一实施例First embodiment
请参考图1,图1是本申请第一实施例提供的一种素材标注方法流程图,该方法包括步骤S101-S103。Please refer to FIG. 1. FIG. 1 is a flowchart of a material labeling method according to a first embodiment of the present application, where the method includes steps S101-S103.
在S101中,根据预设的算法模型对待标注素材集中的素材进行标注。In S101, the materials in the annotation material set are marked according to a preset algorithm model.
在S102中,基于标注的结果,生成与标注的结果对应的训练集。In S102, based on the result of the labeling, a training set corresponding to the result of the labeling is generated.
在S103中,通过训练集更新预设的算法模型,用于下一次的素材标注。In S103, the preset algorithm model is updated by the training set for the next material annotation.
在一实施例中,在素材标注中,素材可以包括智能问答系统中的语料,文本识别中的文本,以及音视频、图片等多媒体素材。这些素材中往往包含了很多很丰富的内容,但是计算机可能不能直接的识别和读取,因此需要对这些素材进行标注,标注也就是对素材库中的素材进行加工,把素材中的多种特征以计算机可识别的方式进行标注。比如说,将图片素材中的以图片形式呈现的信息以文本的格式进行标注,或者是人脸识别,以图像中的面部特征的像素点坐标和像素值进行标注,或者如语料库中的语料,把多种表示语言特征的标签标注在相应的语言成分上,以便于计算机的识别和读取。标注的方式根据应用场 景的不同而有所区别,原理上都是基于一定的逻辑,将待标注素材集中的素材的多个特征进行计算机可识别的标注。In an embodiment, in the material annotation, the material may include corpus in the intelligent question answering system, text in the text recognition, and multimedia materials such as audio and video, pictures, and the like. These materials often contain a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Label in a computer-readable way. For example, the information presented in the picture material in the picture material is marked in the form of text, or face recognition, which is marked with the pixel coordinates and pixel values of the facial features in the image, or as the corpus in the corpus, Label a variety of linguistic features on the corresponding language components for easy identification and reading by the computer. The way of labeling differs according to the application scene. In principle, based on certain logic, multiple features of the material to be labeled are computer-identifiable.
确定已生成的算法模型。算法模型就是标注素材所参考的算法,后续的素材标注参考的算法模型都是上一次素材标注之后所确定出来的算法模型。算法模型需要经过分析训练集来得到。其中,算法模型根据生成时机的不同,大致分为初始算法模型和过渡算法模型两种。初始算法模型,也就是在本次素材标注中,第一个算法模型,这个算法模型大致决定了以后所有相关的素材标注的算法逻辑。过渡算法模型指的就是在初始算法模型之外的算法模型,与初始算法模型不同,过渡算法模型通常是持续变化的。Determine the algorithm model that has been generated. The algorithm model is the algorithm referenced by the annotation material, and the algorithm model referenced by the subsequent material annotation is the algorithm model determined after the last material annotation. The algorithm model needs to be obtained through an analysis training set. Among them, the algorithm model is roughly divided into two types: the initial algorithm model and the transition algorithm model, depending on the timing of generation. The initial algorithm model, that is, the first algorithm model in this material annotation, this algorithm model roughly determines the algorithm logic of all relevant material annotations in the future. The transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
在一实施例中,确定已生成的算法模型,可以包括:对初始素材集中的素材进行人工标注,生成初始训练集;基于初始训练集,训练生成初始算法模型;参考初始算法模型对待标注素材集中的素材进行标注,并基于标注结果更新初始算法模型,形成过渡算法模型;参考过渡算法模型对下一次待标注素材集中的素材进行标注,并基于标注结果更新过渡算法模型,如此迭代进行素材标注和算法模型的更新,确定算法模型。上述步骤示出了算法模型的通用生成方式,算法模型是基于初始算法模型,经过若干次标注更新迭代之后所形成。In an embodiment, determining the generated algorithm model may include: manually labeling the material in the initial material set to generate an initial training set; and based on the initial training set, training to generate an initial algorithm model; referring to the initial algorithm model to treat the labeled material set The material is marked, and the initial algorithm model is updated based on the annotation result to form a transition algorithm model; the reference transition algorithm model is used to label the material of the next to be labeled material set, and the transition algorithm model is updated based on the annotation result, so that the material annotation is performed iteratively. The algorithm model is updated to determine the algorithm model. The above steps show the general generation of the algorithm model, which is based on the initial algorithm model and is formed after several iterations of the annotation update.
在一实施例中,初始算法模型的生成方式为,首先,以人工标注的方式,对初始素材集中的素材进行标注。这里的人工标注没有可参考的算法模型,根据人为的认知来自行确定素材的多个特征如何标注。然后,在标注完成之后,以标注结果为参照,生成与标注的结果对应的初始训练集。训练集是训练生成算法模型的集合,训练集中往往有大量的对象,对这些对象进行训练就可以生成想要的算法模型,而初始训练集,就是最初的用于训练算法模型的训练集。然后,基于初始训练集进行训练,得到初始算法模型。此时,由于初始素材集是以人工标注的方式进行标注的,为了保证所得到的初始算法模型的可靠性,还可以进行校验,校验可由其他人来进行,相当于参考多个校验者来确定初始算法模型。In an embodiment, the initial algorithm model is generated by first marking the material in the initial material set by manual labeling. The manual annotation here has no reference to the algorithm model, and the human cognition comes from determining how to mark multiple features of the material. Then, after the labeling is completed, the initial training set corresponding to the result of the labeling is generated with reference to the labeling result. The training set is a set of training generation algorithm models. There are often a large number of objects in the training set. Training these objects can generate the desired algorithm model. The initial training set is the initial training set used to train the algorithm model. Then, based on the initial training set, the initial algorithm model is obtained. At this time, since the initial material set is marked by manual labeling, in order to ensure the reliability of the obtained initial algorithm model, verification can also be performed, and the verification can be performed by other people, which is equivalent to referencing multiple verifications. To determine the initial algorithm model.
初始算法模型在确定之后,就作为第二次素材标注的算法模型,也就是下一次的算法模型的参考算法模型。而在参考初始算法模型对素材进行标注之后,此时会得到相应的标注结果,以及根据标注结果所生成的训练集;这是不同于初始训练集的新增的训练集,第二次的素材标注中的素材与第一次中的素材往往是不同的,那么,在结合相同的算法模型进行标注之后得到的训练集,作为 初始算法模型的更新包更新初始算法模型,让初始算法模型可以囊括更多更详尽的算法模型。此时所得到的算法模型就不再是初始算法模型,而是算法模型中的过渡算法模型。初始算法模型只有一个,过渡算法模型则通常有多个,这多个过渡算法模型就是通过在每一次的算法模型对素材进行标注之后,得到的训练集再更新该算法模型所得,换言之,每一次的素材标注,所参考的都是上一次素材标注后所更新的算法模型,且本次素材标注之后,更新的算法模型又作为下一次的素材标注所参考的算法模型。如此下去,迭代的次数越多,算法模型的覆盖面就越广,涉及的素材类型和领域就越多,对于后续的素材标记的准确率也就越高。After the initial algorithm model is determined, it is used as the algorithm model for the second material annotation, which is the reference algorithm model of the next algorithm model. After the material is labeled with reference to the initial algorithm model, the corresponding labeling result and the training set generated according to the labeling result are obtained; this is a new training set different from the initial training set, and the second material is The material in the annotation is often different from the material in the first time. Then, the training set obtained after labeling with the same algorithm model is used as an update package of the initial algorithm model to update the initial algorithm model, so that the initial algorithm model can be included. More detailed algorithm models. The algorithm model obtained at this time is no longer the initial algorithm model, but the transition algorithm model in the algorithm model. There is only one initial algorithm model, and there are usually multiple transition algorithm models. The multiple transition algorithm models are obtained by updating the algorithm model after each material model is labeled with the material, in other words, each time. The material labeling refers to the algorithm model updated after the last material labeling, and after the material labeling, the updated algorithm model is used as the algorithm model referenced by the next material labeling. In this way, the more iterations, the wider the coverage of the algorithm model, the more types and fields of material involved, and the higher the accuracy of subsequent material markers.
S101中,根据算法模型对待标注素材集中的素材进行标注。In S101, the material to be labeled in the material set is marked according to the algorithm model.
这里的标注过程就是使用前一次素材集中的素材对待标注素材集中的素材进行标注的迭代过程。在一实施例中,根据算法模型对待标注素材集中的素材进行标注可以包括:确定待标注素材集中,与算法模型领域相同的第一素材,以及与算法模型领域不同的第二素材;直接通过算法模型对第一素材进行标注;以及,通过人工标注对第二素材进行标注。待标注的素材集中的素材,大致可以分为两类:一类是可直接通过算法模型进行标注的,这类的素材即与算法模型领域相同的第一素材;一类是不能直接通过算法模型进行标注的,这类的素材即与算法模型领域不同的第二素材。第一素材由于领域与算法模型一致,大都可以直接进行标注,当然也可能遇到领域相同但是领域之下的类别有所区别,导致部分不能直接标注的,也可以通过人工标注进行标注,也就是,对第一素材中,无法通过算法模型直接进行标注的部分,通过人工标注进行标注;第二素材由于领域与算法模型不同,不能直接标注,往往是直接采取人工标注的方式进行。其中,确定待标注素材集中的第一素材和第二素材的方式,一般是素材提供者事先明确的,在标注前往往已经知道了待标注素材所属的领域;如果素材提供者未明确提供,则可以是通过关键词筛选等方式进行,或者是由人工参与进行判断,或者是直接假设都是相同领域的直接进行标注,对于无法直接标注的部分则分离出来作为不同领域的第二素材进行人工标注。The annotation process here is an iterative process of labeling the material in the annotation material set using the material from the previous material set. In an embodiment, the labeling the material in the annotation material set according to the algorithm model may include: determining a first material to be labeled, the same material as the algorithm model domain, and a second material different from the algorithm model domain; directly passing the algorithm The model labels the first material; and, by manual labeling, the second material. The materials in the material to be labeled can be roughly divided into two categories: one can be directly labeled by the algorithm model, and the material is the same as the first model in the domain of the algorithm model; one class cannot directly pass the algorithm model. For the annotation, this kind of material is the second material different from the algorithm model field. The first material can be directly labeled because the domain is consistent with the algorithm model. Of course, it may also encounter the same field but the categories under the domain are different. As a result, some parts cannot be directly labeled, and they can also be marked by manual labeling, that is, For the first material, the part that cannot be directly labeled by the algorithm model is labeled by manual annotation; the second material cannot be directly labeled because the domain is different from the algorithm model, and is usually directly by manual labeling. Wherein, the manner of determining the first material and the second material in the set of materials to be labeled is generally determined by the material provider in advance, and the area to which the material to be labeled belongs is often known before the labeling; if the material provider does not explicitly provide, It can be done by means of keyword screening, or by manual participation, or it can be directly assumed that the parts are directly labeled in the same field, and the parts that cannot be directly labeled are separated and used as the second material of different fields for manual labeling. .
请参考图2,图2示出了一种素材标注示意图,其中,素材A作为初始素材,以人工标注的形式进行标注并生成训练集A,基于训练集A训练出了算法模型,此处即为初始算法模型;素材B作为与素材A领域相同的素材,也就是与算法模型领域一致的素材,可直接通过集成了该算法模型的自动化标注装置 进行标注。其中,该自动化标注装置除了集成了算法模型之外,还具备一些标注所需的其他组成部分,比如工作流、权限控制等相关功能。尽管如此,素材B中还有领域之下的类别不同的素材B’,不能直接通过算法模型进行标注,而采用人工的方式进行标注;素材C是与素材A领域不同的素材,也就是与算法模型领域不一致的素材,直接通过人工标注的方式进行标注。不管是对素材B的标注,还是对素材B’的标注,还是对素材C的标注,最终均生成对应的训练集,通过训练集来更新算法模型,作为下一次的素材标注所参考的算法模型。Please refer to FIG. 2 , which shows a schematic diagram of material annotation, in which material A is used as an initial material, and is manually labeled to generate training set A, and the algorithm model is trained based on training set A, where The initial algorithm model; material B as the same material as the material A field, that is, the material consistent with the algorithm model domain, can be directly labeled by the automatic annotation device integrated with the algorithm model. Among them, the automatic annotation device not only integrates the algorithm model, but also has some other components required for annotation, such as workflow, permission control and other related functions. However, material B has different types of material B' under the domain, which cannot be directly labeled by the algorithm model, but is manually labeled; material C is a different material from the material A field, that is, the algorithm Materials that are inconsistent in the model domain are directly labeled by manual annotation. Regardless of the labeling of the material B, the labeling of the material B', or the labeling of the material C, the corresponding training set is finally generated, and the algorithm model is updated by the training set as the algorithm model referenced for the next material labeling. .
在一实施例中,根据在多次根据算法模型对待标注集中的素材进行标注时,根据第一素材在待标注素材集中的占比,和/或每次标注的准确率,评估算法模型对待标注素材的标注能力是否达标。在每一次的素材标注中,根据待标注素材的领域与算法模型的领域的相同与否,会对应产生相应的第一素材和第二素材,根据可以直接标注的第一素材在待标注素材集中的占比可以确定算法模型的标注能力;另外,每一次对于待标注素材集的标注之后,再经过校验就可以得知标注的准确率,根据准确率也可以确定算法模型的标注能力。在得知算法模型的标注能力之后,如果算法模型的标注能力较弱,或者是标注能力不达标,则可能需要继续借助素材集进行训练,逐步完善算法模型标注能力。In an embodiment, according to the plurality of materials to be marked according to the algorithm model, the evaluation algorithm model is to be labeled according to the proportion of the first material in the material to be labeled, and/or the accuracy of each labeling. Whether the material's marking ability is up to standard. In each material annotation, according to the same field of the material to be labeled and the domain of the algorithm model, the corresponding first material and second material are correspondingly generated, according to the first material that can be directly labeled in the material to be labeled. The proportion of the algorithm can determine the labeling ability of the algorithm model; in addition, each time the label of the material set to be labeled is checked, the accuracy of the labeling can be known, and the marking ability of the algorithm model can also be determined according to the accuracy rate. After learning the labeling ability of the algorithm model, if the labeling ability of the algorithm model is weak, or the labeling ability is not up to standard, it may be necessary to continue to use the material set for training, and gradually improve the algorithm model labeling ability.
在S102中,基于标注的结果,生成对应的训练集。生成训练集,就为生成算法模型,以及更新算法模型提供了可能,由于初始算法模型已经根据对初始素材的人工标注生成了,因此后续的训练集都是作为更新算法模型而用。In S102, a corresponding training set is generated based on the result of the labeling. Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
在S103中,通过训练集更新算法模型,用于下一次的素材标注。下一次的素材标注一般都是参考上一次的素材标注更新后的算法模型来进行,而迭代的次数越多,覆盖的领域越广,因此需要人工参与的次数也越少,标注的准确率也越高。为了保证素材标注的可靠性,通过训练集更新算法模型可以包括:对训练集进行校验;在校验完成后,通过校验后的训练集对算法模型进行更新。在一实施例中,基于标注的结果对训练集进行校验可以包括:以抽查的方式从训练集中随机抽取部分进行校验;或,以全量的方式,直接校验训练集中的所有内容。In S103, the algorithm model is updated by the training set for the next material annotation. The next material annotation is generally performed with reference to the updated algorithm model of the previous material annotation. The more iterations, the wider the coverage area, so the fewer times the manual participation is required, and the accuracy of the annotation is also The higher. In order to ensure the reliability of the material annotation, updating the algorithm model through the training set may include: verifying the training set; and after the verification is completed, updating the algorithm model through the verified training set. In an embodiment, the verifying the training set based on the result of the labeling may include: randomly extracting the part from the training set for verification in a random manner; or directly verifying all contents in the training set in a full amount manner.
本实施例提供了一种素材标注方法,确定已生成的算法模型,算法模型用于素材标注,根据算法模型对待标注素材集中的素材进行标注,基于标注结果生成对应的训练集,通过训练集更新算法模型,用于下一次素材标注。通过本实施例的实施,以每一次标注后的结果来更新算法模型,从而减少了人工标注 的工作量,同时也提升了标注的一致性和准确性。The embodiment provides a material labeling method, and determines the generated algorithm model. The algorithm model is used for material labeling, and the material of the labeling material set is marked according to the algorithm model, and the corresponding training set is generated based on the labeling result, and updated by the training set. Algorithm model for the next material annotation. Through the implementation of the embodiment, the algorithm model is updated with each labeled result, thereby reducing the workload of manual annotation, and also improving the consistency and accuracy of the annotation.
第二实施例Second embodiment
请参考图3,图3为本申请第二实施例提供的素材标注方法细化流程图。Please refer to FIG. 3. FIG. 3 is a detailed flowchart of a material labeling method according to a second embodiment of the present application.
在S301中,确定待标注的素材集。In S301, the material set to be labeled is determined.
素材可以包括智能问答系统中的语料,文本识别中的文本,以及音视频、图片等多媒体素材。The material can include corpus in the intelligent question answering system, text in text recognition, and multimedia materials such as audio and video and pictures.
在S302中,基于上一次素材标注后的算法模型,对待标注的素材集中的素材进行标注。In S302, the material to be labeled is marked based on the algorithm model after the last material labeling.
素材中往往包含了很多很丰富的内容,但是计算机可能不能直接的识别和读取,因此需要对这些素材进行标注,标注也就是对素材库中的素材进行加工,把素材中的多种特征以计算机可识别的方式进行标注。The material often contains a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Label the way in a computer-readable way.
算法模型根据迭代阶段的不同,大致可分为初始算法模型和过渡算法模型两类;初始算法模型,也就是在本次素材标注中,第一个算法模型,这个算法模型大致决定了以后所有相关的素材标注的算法逻辑。过渡算法模型指的就是在初始算法模型之外的算法模型,与初始算法模型不同,过渡算法模型通常是持续变化的。According to the different iteration stages, the algorithm model can be roughly divided into two types: the initial algorithm model and the transition algorithm model. The initial algorithm model, that is, the first algorithm model in this material annotation, this algorithm model roughly determines all relevant correlations in the future. The algorithmic logic of the material annotation. The transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
在S303中,判断标注是否成功,若是,则转到S304,若否,则转到S307。In S303, it is judged whether the labeling is successful, and if so, the process goes to S304, and if not, the process goes to S307.
判断标注是否成功,也就是从素材中筛选出第一素材和第二素材的过程;而确定待标注素材集中的第一素材和第二素材的方式,可以是通过关键词筛选等方式进行,或者是由人工参与进行判断,或者是直接假设都是相同领域的直接进行标注,对于无法直接标注的部分则分离出来作为不同领域的第二素材进行人工标注。Determining whether the label is successful, that is, the process of filtering the first material and the second material from the material; and determining the manner of the first material and the second material in the material set to be labeled may be performed by keyword screening, or It is judged by human participation, or it is directly assumed that the parts are directly labeled in the same field, and the parts that cannot be directly labeled are separated and used as the second material of different fields for manual labeling.
在S304中,基于标注结果,生成训练集。In S304, a training set is generated based on the labeling result.
生成训练集,就为生成算法模型,以及更新算法模型提供了可能,由于初始算法模型已经根据对初始素材的人工标注生成了,因此后续的训练集都是作为更新算法模型而用。Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
在S305中,对训练集进行校验。In S305, the training set is verified.
为了保证素材标注的可靠性,可对训练集进行校验,校验方式可以包括:以抽查的方式从训练集中随机抽取部分进行校验;或,以全量的方式,直接校验训练集中的所有内容。In order to ensure the reliability of the material labeling, the training set may be verified. The verification method may include: randomly extracting parts from the training set for verification by means of random inspection; or directly verifying all the training sets in a full amount manner. content.
在S306中,通过校验后的训练集更新算法模型,并返回S401。In S306, the algorithm model is updated by the verified training set, and the process returns to S401.
在S307中,通过人工标注对标注失败的素材进行标注。In S307, the material that failed the labeling is marked by manual labeling.
第三实施例Third embodiment
请参考图4,图4为本申请第三实施例提供的素材标注方法的示意图,其中,本实施例中的素材标注方法以银行的业务语料为基础,通过训练相应的算法模型和循环迭代,来实现自动标注,其实现步骤如下S401-S408。Please refer to FIG. 4. FIG. 4 is a schematic diagram of a material labeling method according to a third embodiment of the present invention. The material labeling method in the embodiment is based on a bank business corpus, and the corresponding algorithm model and loop iteration are trained. To achieve automatic labeling, the implementation steps are as follows S401-S408.
在S401中,确定第一批银行A的业务语料。In S401, the business corpus of the first batch of bank A is determined.
在S402中,形成银行A的语料训练集。In S402, a corpus training set of Bank A is formed.
在S403中,基于银行A的语料训练集训练生成算法模型,嵌入智能标注系统中。In S403, the algorithm model of the corpus training set based on the bank A is generated and embedded in the smart labeling system.
在S404中,第二批银行B的业务语料需要进行标注时,判断银行A和银行B的业务语料同属于银行领域的语料,属于同一领域的不同子类,大部分业务用语、词汇相似,于是将银行B的业务语料输入智能标注系统中进行自动化标注。根据需要自动标注的语料规模,可考虑部署分布式智能标注系统。In S404, when the business corpus of the second batch of bank B needs to be marked, it is judged that the business corpus of bank A and bank B belong to the corpus of the banking field, belonging to different sub-categories of the same field, and most of the business terms and vocabulary are similar, so The business corpus of Bank B is entered into the intelligent labeling system for automatic labeling. Depending on the size of the corpus that needs to be automatically annotated, consider deploying a distributed intelligent annotation system.
在S405中,对于银行B的业务语料中,无法通过智能标注系统中的算法模型自动标注的部分,形成X语料’,人工对X语料’进行标注,此时降低了人工标注的工作量和耗时。人工标注的结果形成第二批训练集,再次执行步骤S402-S403,从而实现了算法模型的更新,智能标注系统得到第二次优化和扩充。In S405, for the business corpus of the bank B, the part that is automatically marked by the algorithm model in the smart labeling system cannot be formed, and the X corpus 'is manually marked with the X corpus', thereby reducing the workload and consumption of the manual labeling. Time. The result of the manual labeling forms a second batch of training sets, and steps S402-S403 are performed again, thereby realizing the updating of the algorithm model, and the intelligent labeling system is optimized and expanded for the second time.
在S406中,当第三批银行C的业务语料需要标注时,重复步骤S404中的操作,从而实现了算法模型的更新,智能标注系统得到再一次优化和扩充。In S406, when the business corpus of the third batch of bank C needs to be labeled, the operation in step S404 is repeated, thereby realizing the updating of the algorithm model, and the intelligent labeling system is once again optimized and expanded.
在S407中,当第四批某电商的客服语料需要进行标注时,判断电商的客服语料和已人工标注过的银行语料不是同一领域的语料,用语、词汇差异很大,于是人工对某电商客服语料进行标注。形成第四批训练集,再次执行步骤S402-S403,从而实现了算法模型的更新,智能标注系统得到第四次优化和扩充。In S407, when the customer service corpus of the fourth batch of e-commerce needs to be marked, it is judged that the customer service corpus of the e-commerce and the bank corpus that has been manually marked are not the corpus of the same field, and the terms and vocabulary are very different, so the artificial pair is The e-commerce customer service corpus is marked. The fourth batch of training sets is formed, and steps S402-S403 are performed again, thereby realizing the updating of the algorithm model, and the intelligent labeling system is optimized and expanded for the fourth time.
在S408中,当第五批电商的客服语料需要标注时,重复步骤S404中的操作,从而实现了算法模型的更新,智能标注系统得到第五次优化和扩充。In S408, when the customer service corpus of the fifth batch of e-commerce needs to be labeled, the operation in step S404 is repeated, thereby realizing the update of the algorithm model, and the intelligent annotation system is optimized and expanded for the fifth time.
如果需要对同领域同子类型的素材进行标注,使用智能标注系统对这些素材进行自动化标注,如果智能标注系统已经过多轮迭代和优化扩充,这批语料理论上可以完全实现自动化标注且准确率达标。If you need to mark the same type of material in the same field, use the intelligent annotation system to automatically mark these materials. If the intelligent labeling system has been iterated and optimized for many rounds, the corpus can theoretically achieve automatic labeling and accuracy. Meet the standard.
分析只能标注系统对同领域不同子类、同领域相同子类新素材的自动化标 注比例和准确率,可以判断是否需要收集更多更丰富的语料来继续训练算法模型。The analysis can only indicate the automatic indexing ratio and accuracy of the system for different sub-categories in the same field and the same sub-categories in the same field. It can be judged whether more rich corpus needs to be collected to continue training the algorithm model.
第四实施例Fourth embodiment
请参考图5,图5为本申请第四实施例提供的一种素材标注装置组成示意图,该素材标注装置包括:素材标注模块501、训练生成模块502和算法训练模块503。Please refer to FIG. 5. FIG. 5 is a schematic diagram of a composition of a material labeling device according to a fourth embodiment of the present invention. The material labeling device includes a material labeling module 501, a training generating module 502, and an algorithm training module 503.
素材标注模块501,设置为根据预设的算法模型对待标注素材集中的素材进行标注。The material labeling module 501 is configured to label the materials in the annotation material set according to the preset algorithm model.
训练生成模块502,设置为基于标注的结果,生成与标注的结果对应的训练集。The training generation module 502 is configured to generate a training set corresponding to the labeled result based on the result of the annotation.
算法训练模块503,设置为通过训练集更新算法模型,用于下一次的素材标注。The algorithm training module 503 is configured to update the algorithm model through the training set for the next material annotation.
在一实施例中,素材标注,其中素材可以包括智能问答系统中的语料,文本识别中的文本,以及音视频、图片等多媒体素材。这些素材中往往包含了很多很丰富的内容,但是计算机可能不能直接的识别和读取,因此需要对这些素材进行标注,标注也就是对素材库中的素材进行加工,把素材中的多种特征以计算机可识别的方式进行标注,比如说,将图片素材中的以图片形式呈现的信息以文本的格式进行标注,或者是人脸识别,以图像中的面部特征的像素点坐标和像素值进行标注,或者如语料库中的语料,把多种表示语言特征的标签标注在相应的语言成分上,以便于计算机的识别和读取。标注的方式根据应用场景的不同而有所区别,原理上都是基于一定的逻辑,将待标注素材集中的素材的多个特征进行计算机可识别的标注。In an embodiment, the material is marked, wherein the material may include corpus in the intelligent question answering system, text in the text recognition, and multimedia materials such as audio and video, pictures, and the like. These materials often contain a lot of rich content, but the computer may not be able to directly identify and read, so you need to mark these materials, the annotation is to process the material in the material library, the various features in the material Labeling in a computer-recognizable manner, for example, marking information in a picture material in the form of a text in the form of text, or face recognition, using pixel coordinates and pixel values of facial features in the image Labeling, or corpus in the corpus, labels a variety of linguistic features on the corresponding language components to facilitate computer identification and reading. The way of labeling differs according to the application scenario. In principle, based on certain logic, multiple features of the material to be labeled are computer-identifiable.
确定已生成的算法模型。算法模型就是标注素材所参考的算法,后续的素材标注参考的算法模型都是上一次素材标注之后所确定出来的算法模型。算法模型需要经过分析训练集来得到,其中,算法模型根据生成时机的不同,大致分为初始算法模型和过渡算法模型两种。初始算法模型,也就是在本次素材标注中,第一个算法模型,这个算法模型大致决定了以后所有相关的素材标注的算法逻辑。过渡算法模型指的就是在初始算法模型之外的算法模型,与初始算法模型不同,过渡算法模型通常是持续变化的。Determine the algorithm model that has been generated. The algorithm model is the algorithm referenced by the annotation material, and the algorithm model referenced by the subsequent material annotation is the algorithm model determined after the last material annotation. The algorithm model needs to be obtained through the analysis training set. The algorithm model is roughly divided into two types: the initial algorithm model and the transition algorithm model according to the different generation timing. The initial algorithm model, that is, the first algorithm model in this material annotation, this algorithm model roughly determines the algorithm logic of all relevant material annotations in the future. The transition algorithm model refers to the algorithm model outside the initial algorithm model. Unlike the initial algorithm model, the transition algorithm model usually changes continuously.
在一实施例中,确定已生成的算法模型,可以包括:对初始素材集中的素 材进行人工标注,生成初始训练集;训练生成模块502基于初始训练集,训练生成初始算法模型;素材标注模块501参考初始算法模型对待标注素材集中的素材进行标注;算法训练模块503基于标注结果更新初始算法模型,形成过渡算法模型;素材标注模块501再参考过渡算法模型对下一次待标注素材集中的素材进行标注,算法训练模块503再基于标注结果更新过渡算法模型,如此迭代进行素材标注和算法模型的更新,确定算法模型。上述步骤示出了算法模型的通用生成方式,在一实施例中,算法模型是基于初始算法模型,经过若干次标注更新迭代之后所形成。而初始算法模型的生成方式,一种可选的方式为,首先,以人工标注的方式,对初始素材集中的素材进行标注。这里的人工标注没有可参考的算法模型,根据人为的认知来自行确定素材的多个特征如何标注。然后,在标注完成之后,以标注结果为参照,生成对应的初始训练集。训练集是训练生成算法模型的集合,训练集中往往有大量的对象,对这些对象进行训练就可以生成想要的算法模型,而初始训练集,就是最初的用于训练算法模型的训练集。然后,基于初始训练集进行训练,得到初始算法模型。此时,由于初始素材集是以人工标注的方式进行标注的,为了保证所得到的初始算法模型的可靠性,还可以进行校验,校验可由其他人来进行,相当于参考多个校验者来确定初始算法模型。In an embodiment, determining the generated algorithm model may include: manually labeling the materials in the initial material set to generate an initial training set; and the training generating module 502 is configured to generate an initial algorithm model based on the initial training set; the material labeling module 501 Referring to the initial algorithm model, the material in the annotation material set is marked; the algorithm training module 503 updates the initial algorithm model based on the annotation result to form a transition algorithm model; the material labeling module 501 refers to the transition algorithm model to label the material of the next to be labeled material set. The algorithm training module 503 then updates the transition algorithm model based on the annotation result, so as to iteratively update the material annotation and the algorithm model, and determine the algorithm model. The above steps show a general generation manner of the algorithm model. In an embodiment, the algorithm model is formed based on the initial algorithm model after several iterations of the annotation update. An alternative way to generate the initial algorithm model is to first label the material in the initial material set by manual annotation. The manual annotation here has no reference to the algorithm model, and the human cognition comes from determining how to mark multiple features of the material. Then, after the labeling is completed, the corresponding initial training set is generated with reference to the labeling result. The training set is a set of training generation algorithm models. There are often a large number of objects in the training set. Training these objects can generate the desired algorithm model. The initial training set is the initial training set used to train the algorithm model. Then, based on the initial training set, the initial algorithm model is obtained. At this time, since the initial material set is marked by manual labeling, in order to ensure the reliability of the obtained initial algorithm model, verification can also be performed, and the verification can be performed by other people, which is equivalent to referencing multiple verifications. To determine the initial algorithm model.
初始算法模型在确定之后,就作为第二次素材标注的算法模型,也就是下一次的算法模型的参考算法模型。而在参考初始算法模型对素材进行标注之后,此时会得到相应的标注结果,以及根据标注结果所生成的训练集;这是不同于初始训练集的新增的训练集,第二次的素材标注中的素材与第一次中的素材往往是不同的,那么,在结合相同的算法模型进行标注之后得到的训练集,作为初始算法模型的更新包更新初始算法模型,让初始算法模型可以囊括更多更详尽的算法模型。此时所得到的算法模型就不再是初始算法模型,而是算法模型中的过渡算法模型。初始算法模型只有一个,过渡算法模型则通常有多个,这多个过渡算法模型就是通过在每一次的算法模型对素材进行标注之后,得到的训练集再更新该算法模型所得,换言之,每一次的素材标注,所参考的都是上一次素材标注后所更新的算法模型,且本次素材标注之后,更新的算法模型又作为下一次的素材标注所参考的算法模型。如此下去,迭代的次数越多,算法模型的覆盖面就越广,涉及的素材类型和领域就越多,对于后续的素材标记的准确率也就越高。After the initial algorithm model is determined, it is used as the algorithm model for the second material annotation, which is the reference algorithm model of the next algorithm model. After the material is labeled with reference to the initial algorithm model, the corresponding labeling result and the training set generated according to the labeling result are obtained; this is a new training set different from the initial training set, and the second material is The material in the annotation is often different from the material in the first time. Then, the training set obtained after labeling with the same algorithm model is used as an update package of the initial algorithm model to update the initial algorithm model, so that the initial algorithm model can be included. More detailed algorithm models. The algorithm model obtained at this time is no longer the initial algorithm model, but the transition algorithm model in the algorithm model. There is only one initial algorithm model, and there are usually multiple transition algorithm models. The multiple transition algorithm models are obtained by updating the algorithm model after each material model is labeled with the material, in other words, each time. The material labeling refers to the algorithm model updated after the last material labeling, and after the material labeling, the updated algorithm model is used as the algorithm model referenced by the next material labeling. In this way, the more iterations, the wider the coverage of the algorithm model, the more types and fields of material involved, and the higher the accuracy of subsequent material markers.
素材标注模块501设置为根据算法模型对待标注素材集中的素材进行标注。这里的标注过程就是前一次素材集中的素材标注的下一个迭代;而在一实施例中,根据算法模型对待标注素材集中的素材进行标注可以包括:确定待标注素材集中,与算法模型领域相同的第一素材,以及与算法模型领域不同的第二素材;直接通过算法模型对第一素材进行标注;以及,通过人工标注对第二素材进行标注。待标注的素材集中的素材,大致可以分为两类:一类是可直接通过算法模型进行标注的,这类的素材即与算法模型领域相同的第一素材;一类是不能直接通过算法模型进行标注的,这类的素材即与算法模型领域不同的第二素材。第一素材由于领域与算法模型一致,大都可以直接进行标注,当然也可能遇到领域相同但是领域之下的类别有所区别,导致部分不能直接标注的,也可以通过人工标注进行标注;第二素材由于领域与算法模型不同,不能直接标注,往往是直接采取人工标注的方式进行。其中,确定待标注素材集中的第一素材和第二素材的方式,可以是通过关键词筛选等等方式进行,或者是由人工参与进行判断,或者是直接假设都是相同领域的直接进行标注,对于无法直接标注的部分则分离出来作为不同领域的第二素材进行人工标注。The material labeling module 501 is configured to label the material to be labeled in the material set according to the algorithm model. The labeling process here is the next iteration of the material labeling in the previous material set; and in an embodiment, labeling the material in the labeling material set according to the algorithm model may include: determining the material set to be labeled, which is the same as the algorithm model domain. The first material, and the second material different from the algorithm model domain; the first material is directly labeled by the algorithm model; and the second material is labeled by manual labeling. The materials in the material to be labeled can be roughly divided into two categories: one can be directly labeled by the algorithm model, and the material is the same as the first model in the domain of the algorithm model; one class cannot directly pass the algorithm model. For the annotation, this kind of material is the second material different from the algorithm model field. The first material can be directly labeled because the domain is consistent with the algorithm model. Of course, it may also encounter the same field but the categories below the domain are different, resulting in some parts that cannot be directly labeled, and can also be marked by manual labeling; Because the material is different from the algorithm model, it cannot be directly labeled, and it is often taken directly by manual labeling. The method for determining the first material and the second material in the material set to be labeled may be performed by keyword screening or the like, or may be judged by manual participation, or directly assumed to be directly labeled in the same field. For parts that cannot be directly labeled, they are separated and used as the second material of different fields for manual labeling.
请参考图2,图2示出了一种素材标注示意图,其中,素材A作为初始素材,以人工标注的形式进行标注并生成训练集A,基于训练集A训练出了算法模型,此处即为初始算法模型;素材B作为与素材A领域相同的素材,也就是与算法模型领域一致的素材,可直接通过该算法模型进行标注;尽管如此,素材B中还有领域之下的类别不同的素材B’,不能直接通过算法模型进行标注,而采用人工的方式进行标注;素材C是与素材A领域不同的素材,也就是与算法模型领域不一致的素材,直接通过人工标注的方式进行标注。不管是对素材B的标注,还是对素材B’的标注,还是对素材C的标注,最终均生成对应的训练集,通过训练集来更新算法模型,作为下一次的素材标注所参考的算法模型。Please refer to FIG. 2 , which shows a schematic diagram of material annotation, in which material A is used as an initial material, and is manually labeled to generate training set A, and the algorithm model is trained based on training set A, where The initial algorithm model; material B as the same material as the material A field, that is, the material consistent with the domain of the algorithm model, can be directly labeled by the algorithm model; however, the material B has different categories under the field. The material B' cannot be directly labeled by the algorithm model, but is manually labeled; the material C is a material different from the material A field, that is, a material that is inconsistent with the algorithm model field, and is directly labeled by manual labeling. Regardless of the labeling of the material B, the labeling of the material B', or the labeling of the material C, the corresponding training set is finally generated, and the algorithm model is updated by the training set as the algorithm model referenced for the next material labeling. .
训练生成模块502设置为基于标注的结果,生成对应的训练集。生成训练集,就为生成算法模型,以及更新算法模型提供了可能,由于初始算法模型已经根据对初始素材的人工标注生成了,因此后续的训练集都是作为更新算法模型而用。The training generation module 502 is configured to generate a corresponding training set based on the results of the annotations. Generating the training set provides the possibility to generate the algorithm model and update the algorithm model. Since the initial algorithm model has been generated based on the manual annotation of the initial material, the subsequent training sets are used as the update algorithm model.
算法训练模块503设置为通过训练集更新算法模型,用于下一次的素材标注。下一次的素材标注一般都是参考上一次的素材标注更新后的算法模型来进行,而迭代的次数越多,覆盖的领域越广,因此需要人工参与的次数也越少, 标注的准确率也越高。为了保证素材标注的可靠性,通过训练集更新算法模型可以包括:基于标注的结果对训练集进行校验;在校验完成后,通过校验后的训练集对算法模型进行更新。在一实施例中,基于标注的结果对训练集进行校验可以包括:以抽查的方式从训练集中随机抽取部分进行校验;或,以全量的方式,直接校验训练集中的所有内容。The algorithm training module 503 is configured to update the algorithm model through the training set for the next material annotation. The next material annotation is generally carried out with reference to the updated algorithm model of the previous material annotation. The more iterations, the wider the coverage area, the less the number of manual interventions required, and the accuracy of labeling. The higher. In order to ensure the reliability of the material annotation, updating the algorithm model through the training set may include: verifying the training set based on the result of the annotation; and after the verification is completed, updating the algorithm model through the verified training set. In an embodiment, the verifying the training set based on the result of the labeling may include: randomly extracting the part from the training set for verification in a random manner; or directly verifying all contents in the training set in a full amount manner.
本实施例提供了一种素材标注装置,确定已生成的算法模型,算法模型用于素材标注,根据算法模型对待标注素材集中的素材进行标注,基于标注结果生成对应的训练集,通过训练集更新算法模型,用于下一次素材标注。通过本实施例的实施,以每一次标注后的结果来更新算法模型,从而减少了人工标注的工作量,同时也提升了标注的一致性和准确性。The embodiment provides a material labeling device, which determines the generated algorithm model, the algorithm model is used for material labeling, and the material of the labeling material set is marked according to the algorithm model, and the corresponding training set is generated based on the labeling result, and is updated by the training set. Algorithm model for the next material annotation. Through the implementation of the embodiment, the algorithm model is updated with each labeled result, thereby reducing the workload of manual annotation, and also improving the consistency and accuracy of the annotation.
第五实施例Fifth embodiment
请参考图6,图6为本申请第五实施例提供的一种终端的组成示意图,包括:Please refer to FIG. 6. FIG. 6 is a schematic structural diagram of a terminal according to a fifth embodiment of the present disclosure, including:
处理器601、存储器602和通信总线603;通信总线603设置为实现处理器601和存储器602之间的连接通信;处理器601设置为执行存储器602中存储的素材标注程序,以实现前述的素材标注方法的实施例,这里不再赘述。The processor 601, the memory 602 and the communication bus 603; the communication bus 603 is arranged to implement connection communication between the processor 601 and the memory 602; the processor 601 is arranged to execute the material labeling program stored in the memory 602 to implement the aforementioned material labeling Embodiments of the method are not described herein again.
此外,本实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有一个或者多个计算机程序,计算机程序可被一个或者多个处理器执行,以实现前述的素材标注方法的实施例,这里不再赘述。In addition, the embodiment further provides a computer readable storage medium, where the computer readable storage medium stores one or more computer programs, and the computer program can be executed by one or more processors to implement the foregoing material labeling. Embodiments of the method are not described herein again.
显然,本领域的技术人员应该明白,上述本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储介质(ROM/RAM、磁碟、光盘)中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。所以,本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above modules or steps of the present application can be implemented by a general computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in a storage medium (ROM/RAM, diskette, optical disk) by a computing device, and in some cases The steps shown or described may be performed in an order different than that herein, or they may be separately fabricated into individual integrated circuit modules, or a plurality of the modules or steps may be implemented as a single integrated circuit module. Therefore, the application is not limited to any particular combination of hardware and software.

Claims (13)

  1. 一种素材标注方法,包括:A material annotation method, including:
    根据预设的算法模型对待标注素材集中的素材进行标注;Label the material in the annotation material set according to the preset algorithm model;
    基于标注的结果,生成与标注的结果对应的训练集;Generating a training set corresponding to the result of the annotation based on the result of the annotation;
    通过所述训练集更新所述预设的算法模型,用于下一次的素材标注。The preset algorithm model is updated by the training set for the next material annotation.
  2. 如权利要求1所述的方法,其中,所述根据预设的算法模型对待标注素材集中的素材进行标注包括:The method of claim 1, wherein the labeling of the material to be labeled in the set of materials according to the preset algorithm model comprises:
    确定所述待标注素材集中,与所述预设的算法模型领域相同的第一素材,以及与所述预设的算法模型领域不同的第二素材;Determining, in the set of the to-be-labeled material, a first material that is the same as the preset algorithm model domain, and a second material that is different from the preset algorithm model domain;
    直接通过所述预设的算法模型对所述第一素材进行标注;以及,Labeling the first material directly through the preset algorithm model; and,
    通过人工标注对所述第二素材进行标注。The second material is labeled by manual labeling.
  3. 如权利要求2所述的方法,其中,所述直接通过所述预设的算法模型对所述第一素材进行标注包括:The method of claim 2, wherein the labeling the first material directly by the preset algorithm model comprises:
    对所述第一素材中,无法通过所述预设的算法模型进行标注的部分,通过人工标注进行标注。For the first material, the portion that cannot be marked by the preset algorithm model is labeled by manual labeling.
  4. 如权利要求1-3任一项所述的方法,还包括:The method of any of claims 1-3, further comprising:
    在多次根据所述预设的算法模型对待标注素材集中的素材进行标注时,根据每次标注的准确率,评估所述预设的算法模型对所述待标注素材的标注能力是否达标。When the materials of the annotation material set are to be labeled according to the preset algorithm model, the accuracy of each annotation is evaluated according to the accuracy of each annotation.
  5. 如权利要求1-3任一项所述的方法,其中,所述通过所述训练集更新所述预设的算法模型包括:The method of any one of claims 1-3, wherein the updating the preset algorithm model by the training set comprises:
    对所述训练集进行校验;Performing verification on the training set;
    在校验完成后,通过校验后的训练集对所述预设的算法模型进行更新。After the verification is completed, the preset algorithm model is updated by the verified training set.
  6. 如权利要求5所述的方法,其中,所述对所述训练集进行校验包括:The method of claim 5 wherein said verifying said training set comprises:
    以抽查的方式从所述训练集中随机抽取部分进行校验;或,以全量的方式,直接校验所述训练集中的所有内容。Randomly extracting portions from the training set for verification in a random manner; or, in a full amount, directly verifying all contents in the training set.
  7. 一种素材标注装置,包括:A material labeling device comprising:
    素材标注模块,设置为根据预设的算法模型对待标注素材集中的素材进行标注;a material labeling module, configured to label the materials in the annotation material set according to a preset algorithm model;
    训练生成模块,设置为基于标注的结果,生成与标注的结果对应的训练集;a training generation module configured to generate a training set corresponding to the labeled result based on the result of the annotation;
    算法训练模块,设置为通过所述训练集更新所述预设的算法模型,用于下一次的素材标注。The algorithm training module is configured to update the preset algorithm model through the training set for the next material annotation.
  8. 如权利要求7所述的装置,其中,所述素材标注模块还设置为:The apparatus of claim 7, wherein the material annotation module is further configured to:
    确定所述待标注素材集中,与所述预设的算法模型领域相同的第一素材,以及与所述预设的算法模型领域不同的第二素材;Determining, in the set of the to-be-labeled material, a first material that is the same as the preset algorithm model domain, and a second material that is different from the preset algorithm model domain;
    直接通过所述预设的算法模型对所述第一素材进行标注;以及,Labeling the first material directly through the preset algorithm model; and,
    通过人工标注对所述第二素材进行标注。The second material is labeled by manual labeling.
  9. 如权利要求8所述的装置,其中,所述素材标注模块还设置为:The apparatus of claim 8 wherein said material annotation module is further configured to:
    对所述第一素材中,无法通过所述预设的算法模型进行标注的部分,通过人工标注进行标注。For the first material, the portion that cannot be marked by the preset algorithm model is labeled by manual labeling.
  10. 如权利要求7-9任一项所述的装置,其中,所述算法训练模块设置为:以下述操作来通过所述训练集更新所述预设的算法模型:The apparatus of any one of claims 7-9, wherein the algorithm training module is configured to: update the preset algorithm model through the training set by:
    对所述训练集进行校验;Performing verification on the training set;
    在校验完成后,通过校验后的训练集对所述预设的算法模型进行更新。After the verification is completed, the preset algorithm model is updated by the verified training set.
  11. 如权利要求10所述的装置,其中,所述所述算法训练模块设置为通过以下操作对所述训练集进行校验:The apparatus of claim 10 wherein said algorithm training module is configured to verify said training set by:
    以抽查的方式从所述训练集中随机抽取部分进行校验;或,以全量的方式,直接校验所述训练集中的所有内容。Randomly extracting portions from the training set for verification in a random manner; or, in a full amount, directly verifying all contents in the training set.
  12. 一种终端,包括处理器、存储器和通信总线;所述通信总线设置为实现所述处理器和存储器之间的连接通信;所述处理器设置为执行所述存储器中存储的素材标注程序,以实现如权利要求1-6任一项所述的素材标注方法。A terminal comprising a processor, a memory, and a communication bus; the communication bus is configured to implement connection communication between the processor and the memory; the processor is configured to execute a material annotation program stored in the memory to A material labeling method according to any one of claims 1 to 6.
  13. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个计算机程序,所述计算机程序可被至少一个处理器执行,以实现如权利要求1-6任一项所述的素材标注方法。A computer readable storage medium having stored therein at least one computer program executable by at least one processor to implement the material of any of claims 1-6 Labeling method.
PCT/CN2018/109774 2017-11-17 2018-10-11 Material annotation method and apparatus, terminal, and computer readable storage medium WO2019095899A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711148095.1A CN109800776A (en) 2017-11-17 2017-11-17 Material mask method, device, terminal and computer readable storage medium
CN201711148095.1 2017-11-17

Publications (1)

Publication Number Publication Date
WO2019095899A1 true WO2019095899A1 (en) 2019-05-23

Family

ID=66540040

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109774 WO2019095899A1 (en) 2017-11-17 2018-10-11 Material annotation method and apparatus, terminal, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109800776A (en)
WO (1) WO2019095899A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859862A (en) * 2020-07-22 2020-10-30 海尔优家智能科技(北京)有限公司 Text data labeling method and device, storage medium and electronic device
CN112949674A (en) * 2020-08-22 2021-06-11 上海昌投网络科技有限公司 Multi-model fused corpus generation method and device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751224B (en) * 2019-10-25 2022-08-05 Oppo广东移动通信有限公司 Training method of video classification model, video classification method, device and equipment
CN113380384A (en) * 2021-05-01 2021-09-10 首都医科大学宣武医院 Method for training medical image labeling model through man-machine cooperation, labeling method and labeling system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model
CN102163285A (en) * 2011-03-09 2011-08-24 北京航空航天大学 Cross-domain video semantic concept detection method based on active learning
CN104142912A (en) * 2013-05-07 2014-11-12 百度在线网络技术(北京)有限公司 Accurate corpus category marking method and device
CN106844348A (en) * 2017-02-13 2017-06-13 哈尔滨工业大学 A kind of Chinese sentence functional component analysis method
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8250015B2 (en) * 2009-04-07 2012-08-21 Microsoft Corporation Generating implicit labels and training a tagging model using such labels
CN101853400B (en) * 2010-05-20 2012-09-26 武汉大学 Multiclass image classification method based on active learning and semi-supervised learning
WO2014183275A1 (en) * 2013-05-15 2014-11-20 中国科学院自动化研究所 Detection method and system for locally deformable object based on on-line learning
CN103617429A (en) * 2013-12-16 2014-03-05 苏州大学 Sorting method and system for active learning
CN105117429B (en) * 2015-08-05 2018-11-23 广东工业大学 Scene image mask method based on Active Learning and multi-tag multi-instance learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model
CN102163285A (en) * 2011-03-09 2011-08-24 北京航空航天大学 Cross-domain video semantic concept detection method based on active learning
CN104142912A (en) * 2013-05-07 2014-11-12 百度在线网络技术(北京)有限公司 Accurate corpus category marking method and device
CN106844348A (en) * 2017-02-13 2017-06-13 哈尔滨工业大学 A kind of Chinese sentence functional component analysis method
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859862A (en) * 2020-07-22 2020-10-30 海尔优家智能科技(北京)有限公司 Text data labeling method and device, storage medium and electronic device
CN111859862B (en) * 2020-07-22 2024-03-22 海尔优家智能科技(北京)有限公司 Text data labeling method and device, storage medium and electronic device
CN112949674A (en) * 2020-08-22 2021-06-11 上海昌投网络科技有限公司 Multi-model fused corpus generation method and device

Also Published As

Publication number Publication date
CN109800776A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
WO2019095899A1 (en) Material annotation method and apparatus, terminal, and computer readable storage medium
CN108052577B (en) Universal text content mining method, device, server and storage medium
WO2021043085A1 (en) Method and apparatus for recognizing named entity, computer device, and storage medium
US9842043B2 (en) System and method to implement an electronic document based automated testing of a software application
CN111177569A (en) Recommendation processing method, device and equipment based on artificial intelligence
US10713306B2 (en) Content pattern based automatic document classification
CN111061867B (en) Text generation method, equipment, storage medium and device based on quality perception
CN111653274B (en) Wake-up word recognition method, device and storage medium
CN111694937A (en) Interviewing method and device based on artificial intelligence, computer equipment and storage medium
WO2023065746A1 (en) Algorithm application element generation method and apparatus, electronic device, computer program product and computer readable storage medium
CN109033220B (en) Automatic selection method, system, equipment and storage medium of labeled data
CN109375943A (en) A kind of program file generation method and device
CN114240101A (en) Risk identification model verification method, device and equipment
CN109858024B (en) Word2 vec-based room source word vector training method and device
CN114639152A (en) Multi-modal voice interaction method, device, equipment and medium based on face recognition
CN111859862A (en) Text data labeling method and device, storage medium and electronic device
CN109766089B (en) Code generation method and device based on dynamic diagram, electronic equipment and storage medium
CN115828022A (en) Data identification method, federal training model, device and equipment
KR20210009885A (en) Method, device and computer readable storage medium for automatically generating content regarding offline object
CN113032257B (en) Automated testing method, apparatus, computer system, and readable storage medium
CN115345600A (en) RPA flow generation method and device
CN110515970B (en) Service processing method, device, computer equipment and storage medium
CN111488737B (en) Text recognition method, device and equipment
US11064268B2 (en) Media content metadata mapping
CN112364640A (en) Entity noun linking method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18878729

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/09/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18878729

Country of ref document: EP

Kind code of ref document: A1