CN117649567B - Data labeling method, device, computer equipment and storage medium - Google Patents

Data labeling method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117649567B
CN117649567B CN202410124620.XA CN202410124620A CN117649567B CN 117649567 B CN117649567 B CN 117649567B CN 202410124620 A CN202410124620 A CN 202410124620A CN 117649567 B CN117649567 B CN 117649567B
Authority
CN
China
Prior art keywords
data
marked
matching
labeling
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410124620.XA
Other languages
Chinese (zh)
Other versions
CN117649567A (en
Inventor
王继天
冯帅
周梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410124620.XA priority Critical patent/CN117649567B/en
Publication of CN117649567A publication Critical patent/CN117649567A/en
Application granted granted Critical
Publication of CN117649567B publication Critical patent/CN117649567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present application relates to a data labeling method, apparatus, computer device, storage medium and computer program product. The method involves artificial intelligence techniques, including: acquiring at least one data to be marked from a data set to be marked, and determining at least two candidate categories; for each piece of data to be marked, determining at least two preliminary matching categories matched with the data to be marked from at least two candidate categories; when the primary matching category comprises a reference category, marking the targeted data to be marked based on at least two primary matching categories to obtain marked data of the targeted data to be marked; the reference category is determined by counting the marked data in the marked data set; and updating the marked data set through the obtained marked data, and continuing marking until the data marking is completed for the data to be marked in the data set to be marked. By adopting the method, the category distribution balance of the marked data can be ensured.

Description

Data labeling method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of computer technology, and in particular, to a data labeling method, apparatus, computer device, storage medium, and computer program product.
Background
With the development of computer technology, artificial intelligence (Artificial Intelligence, AI) technology including research of robots, language recognition, image recognition, natural language processing, and expert systems has been widely used. The data classification, target detection, intention recognition, semantic understanding and other computer tasks constructed based on the artificial intelligence technology can effectively improve the processing efficiency and accuracy of the corresponding tasks.
Often, a large amount of labeled sample data is needed to implement model learning when building a computer task to achieve the intended computer task through the model. However, when the sample data is marked, the problem that the sample data size difference under different classes is overlarge easily occurs, so that the marked sample data category distribution is unbalanced, and the data training effect is affected.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data labeling method, apparatus, computer device, computer-readable storage medium, and computer program product that can ensure uniform distribution of categories of labeling data.
In a first aspect, the present application provides a data labeling method. The method comprises the following steps:
acquiring at least one data to be marked from a data set to be marked, and determining at least two candidate categories;
For each datum to be marked in at least one datum to be marked, determining at least two preliminary matching categories matched with the datum to be marked from at least two candidate categories;
when at least two kinds of preliminary matching categories comprise reference categories, marking the targeted data to be marked based on the at least two kinds of preliminary matching categories, and obtaining marked data of the targeted data to be marked; the reference category is determined by counting the annotated data in the annotated data set;
updating the marked data set through the marked data of each of the at least one data to be marked, and continuing marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
In a second aspect, the application further provides a data labeling device. The device comprises:
the data acquisition module is used for acquiring at least one datum to be marked from the datum set to be marked and determining at least two candidate categories;
the primary matching module is used for determining at least two primary matching categories matched with the data to be marked from at least two candidate categories aiming at each data to be marked in at least one data to be marked;
The marking triggering module is used for marking the targeted data to be marked based on at least two preliminary matching categories when the at least two preliminary matching categories comprise reference categories, so as to obtain marked data of the targeted data to be marked; the reference category is determined by counting the annotated data in the annotated data set;
and the data set updating module is used for updating the marked data set through the marked data of each of the at least one data to be marked, and continuing marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the data labeling method when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above data annotation method.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the above data annotation method.
According to the data labeling method, the device, the computer equipment, the storage medium and the computer program product, when at least two preliminary matching categories for which the data to be labeled are matched include a reference category determined by counting the labeled data in the labeled data set, labeling is carried out on the data to be labeled based on the at least two preliminary matching categories, labeled data corresponding to the data to be labeled is obtained, a labeled data set is updated through the labeled data of each data to be labeled, and labeling is carried out on the data to be labeled in the data set to be labeled until the data labeling is completed. When at least two kinds of preliminary matching categories matched with the data to be marked comprise reference categories, marking the data to be marked based on the at least two kinds of preliminary matching categories, the reference categories obtained by counting marked data in a marked data set can be utilized to adjust marking of the data to be marked, and repeated marking is carried out after the marked data set is updated according to the marked data, so that the data quantity of the marked data under different categories can be adjusted, and the category distribution of the marked data is ensured to be balanced, so that the training effect based on the marked data is enhanced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for a person having ordinary skill in the art.
FIG. 1 is an application environment diagram of a data annotation method in one embodiment;
FIG. 2 is a flow chart of a method of labeling data in one embodiment;
FIG. 3 is a schematic flow chart of data annotation in another embodiment;
FIG. 4 is a flow diagram of a process for obtaining a text matching model in one embodiment;
FIG. 5 is a flow diagram of a text labeling method in one embodiment;
FIG. 6 is a flow diagram of text matching in one embodiment;
FIG. 7 is a flow diagram of sample screening based on active learning in one embodiment;
FIG. 8 is a flow diagram of large model labeling in one embodiment;
FIG. 9 is a flow diagram of determining large models and hints words in one embodiment;
FIG. 10 is a block diagram of a data tagging device in one embodiment;
FIG. 11 is an internal block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, which refers to a deep neural network (Deep neural network, DNN) with large parameters, the deep neural network is trained on massive unlabeled data, the PTM extracts common characteristics on the data by utilizing the function approximation capability of the large-parameter DNN, and the deep neural network is suitable for downstream tasks through technologies such as fine tuning (fine tuning), efficient fine tuning (PEFT) and prompt-tuning. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of the process into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of the characteristics of two or more data modalities. The pre-training model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, following and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings revolution for the development of the voice technology, and WavLM, uniSpeech and other pre-training models which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technical pre-training model for artificial intelligence domain model training is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generated Content (AIGC), conversational interactions, smart medical, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence, such as computer vision technology, voice technology, natural language processing technology, machine learning technology and the like, so as to carry out category labeling processing on various modal data, and is specifically described through the following embodiments.
The data labeling method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be provided separately, may be integrated on the server 104, or may be placed on a cloud or other server. The user may collect various data to be annotated through the terminal 102, for example, various forms of data including images to be annotated, texts to be annotated, audios and videos to be annotated, and the like. The user can set at least two candidate categories for each data to be marked through the terminal 102, and the specific category number and category type of the candidate categories can be set according to actual needs, for example, the specific category number and category type can be set according to the task purpose of data marking. For example, in labeling data for a cat/dog identification task for an image to be labeled, the candidate categories set may include both cat and dog categories. The terminal 102 may send the collected data set to be marked composed of each data set to be marked to the server 104 through the network, the server 104 obtains at least one data set to be marked from the data set to be marked, and for each data set to be marked obtained from the data set to be marked, the server 104 determines at least two preliminary matching categories matched with the data set to be marked from at least two candidate categories. When at least two kinds of preliminary matching categories for matching the data to be marked include the reference category determined by counting marked data in the marked data set, the server 104 marks the data to be marked based on at least two kinds of preliminary matching categories, and obtains marked data corresponding to the data to be marked. After labeling each piece of data to be labeled obtained through traversal, the server 104 updates the labeled data set through the labeled data of each piece of data to be labeled, and continues labeling the data to be labeled in the data set to be labeled until the data labeling is completed. After labeling is complete, the server 104 may feed back the obtained labeled dataset to the terminal 102.
The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster or cloud server composed of a plurality of servers.
In an exemplary embodiment, as shown in fig. 2, a data labeling method is provided, where the method is performed by a computer device, specifically, may be performed by a computer device such as a terminal or a server, or may be performed by the terminal and the server together, and in this embodiment, the method is applied to the server in fig. 1, and is described by taking the example as an example, and includes the following steps 202 to 208. Wherein:
step 202, at least one data to be marked is obtained from the data set to be marked, and at least two candidate categories are determined.
The data set to be marked is a set composed of data to be marked, and the data set to be marked can comprise at least one data to be marked, wherein the data to be marked is the data to be marked, and can be the data to be marked in category. In different application scenarios, the data to be annotated may include data of different modes, for example, data of at least one mode of various modes such as an image, an audio/video, a text, and the like. The candidate categories are used for marking each data to be marked, namely, when marking the data to be marked, selecting a target category from the candidate categories for marking. For example, the data to be annotated may be an image mode, that is, the data set to be annotated may include a plurality of pictures to be annotated, and the candidate categories for each picture may include 5 categories, such as category a, category B, category C, category D, and category E, that is, when the pictures in the data set to be annotated are annotated, the category is selected from 5 categories, such as category a, category B, category C, category D, and category E, to be annotated, for example, the picture 1 may be annotated as category E.
Specifically, the server may acquire a data set to be marked, where the data set to be marked may include each data to be marked that needs to be marked, the server acquires at least one data to be marked from the data set to be marked, and the specific server may acquire part of the data to be marked that is not marked from the data set to be marked, or may directly acquire all the data to be marked that is not marked from the data set to be marked. The server determines candidate categories when labeling each piece of data to be labeled, wherein the candidate categories comprise at least two types, namely when labeling the data to be labeled, a target category is required to be selected from the at least two types of candidate categories for labeling, so that corresponding category labels are added to the data to be labeled, and labeled data are obtained, so that training learning of computer tasks is performed through the labeled data.
Step 204, for each data to be annotated in the at least one data to be annotated, determining at least two preliminary matching categories matching the data to be annotated from at least two candidate categories.
The preliminary matching category is a candidate category matched with the data to be marked for marking, and specifically, the data to be marked for marking and various candidate categories can be respectively matched and then preliminarily determined.
Optionally, the server may traverse the obtained data to be marked to perform marking processing, and specifically, for each data to be marked, the server may determine a preliminary matching category corresponding to the matching from at least two candidate categories. In specific implementation, the server may match the data to be marked with each candidate category, for example, may perform feature similarity matching respectively, and the server may determine a preliminary matching category matching with the data to be marked from the candidate categories according to a matching result, for example, may determine candidate categories with feature similarity exceeding a similarity threshold as preliminary matching categories, may determine N candidate categories with highest feature similarity values as preliminary matching categories, and N may be an integer not less than 2. The number of the categories of the preliminary matching categories is at least two, and the number of the categories of the preliminary matching categories can be flexibly set according to actual needs, for example, the number of the categories can be 2, 3 or 4. The server traverses the acquired data to be marked, and each data to be marked can determine at least two preliminary matching categories which are matched with each other. In some applications, the server may separately match the data to be annotated with each candidate category by pre-training a matching model based on various artificial neural network algorithms to determine a preliminary matching category that matches the data to be annotated. For example, at least one algorithm of a recurrent neural network (RNN, recurrent Neural Network), a convolutional neural network (CNN, convolutional Neural Networks), a attention mechanism (transducer), a multi-layer perceptron (MLP, multilayer Perceptron), a Long Short-Term Memory network (LSTM), or a gated loop unit (GRU, gate Recurrent Unit) may be used to train to obtain the matching model.
Step 206, when at least two kinds of preliminary matching categories include reference categories, labeling the targeted data to be labeled based on the at least two kinds of preliminary matching categories, so as to obtain labeled data of the targeted data to be labeled; the reference category is determined by counting the annotated data in the annotated data set.
Wherein the reference category is a category determined by counting the marked data in the marked data set, and the reference category belongs to the candidate category. The marked data set is used for recording marked data, the marked data is obtained by marking the data to be marked in the data set to be marked, and the marked data can carry marked category labels. The reference category may be determined by counting the labeled data in the labeled data set, specifically, the data amount of the corresponding labeled data under each category may be counted, and the reference category may be determined based on the data amount of the corresponding labeled data under each category, e.g., n categories with the least data amount of the corresponding labeled data may be determined as the reference category.
For example, the server may obtain a reference category that is determined based on the annotated data in the set of statistically annotated data. In a specific implementation, the server may obtain the labeled data set, and perform statistics on labeled data recorded in the labeled data set, for example, may calculate a distribution of categories shown by each labeled data, so as to determine n categories including the labeled data with the least data amount as reference categories. The server can traverse each piece of acquired data to be marked, and for each piece of data to be marked, the server can compare at least two kinds of preliminary matching categories matched with the data to be marked with reference categories respectively to determine whether the at least two kinds of preliminary matching categories comprise the reference categories, and when the at least two kinds of preliminary matching categories comprise the reference categories, the server can determine the data to be marked as the data to be marked for the current time, and the server can obtain the marked data corresponding to the data to be marked for marking based on the at least two kinds of preliminary matching categories matched with the data to be marked. In a specific application, the server can label according to at least two preliminary matching categories of the data to be labeled, namely, the server can select a target category from the at least two preliminary matching categories of the data to be labeled to label the data to be labeled, so as to obtain labeled data, wherein the labeled data is data obtained by labeling the data to be labeled, namely, the labeled data can be formed after labels of corresponding categories are added to the data to be labeled after the data to be labeled is labeled, and the labeling processing of the data to be labeled is completed. In some applications, the server can label the data to be labeled through a data labeling model which is trained based on various artificial neural network algorithms in advance, so as to obtain corresponding labeled data.
In some embodiments, when it is determined that the reference category is not included in at least two preliminary matching categories for which the data to be annotated matches, the server may determine that the data to be annotated is not annotated at this time, so as to perform annotation determination for the next data to be annotated.
And step 208, updating the marked data set through the marked data of each of the at least one data to be marked, and continuing marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
Specifically, after the server traverses the obtained at least one piece of data to be marked and respectively marks the data, the server can obtain marked data corresponding to the data to be marked for at least two matched preliminary matching categories including reference categories, and the server can update marked data sets, such as adding marked data into the marked data sets. The server continues to label the data to be labeled in the data set to be labeled until the data label is completed for the data to be labeled in the data set to be labeled, namely, the labeling process is ended until all the data to be labeled in the data set to be labeled are labeled. In a specific implementation, after updating the marked data set, the server can reject marked data to be marked which is marked completely in the marking of the present time from the marked data set, so that the remaining marked data to be marked in the marked data set is updated, and the marking is performed during the next marking processing. In some applications, the server may add the marked data obtained by the current marking to the marked data set, so as to update the marked data set, the server may count the marked data in the updated marked data set, so as to determine a reference category corresponding to the next marking, and the server may return to the step of obtaining at least one piece of data to be marked from the data set to be marked, so as to perform the next marking process.
In a specific application, as shown in fig. 3, a server obtains a data set to be marked, where the data set to be marked includes a plurality of data D to be marked, the server obtains n data D to be marked from the data set to be marked, and obtains data D1, D2, D3 … … Dn to be marked, and for each data D to be marked, the server matches each data D with m candidate categories, so as to determine 2 preliminary matching categories corresponding to the matching. Specifically, the preliminary matching categories of the data to be marked D1 include category 1 and category 3, the preliminary matching categories of the data to be marked D2 include category 3 and category 4, the preliminary matching categories of the data to be marked D3 include category 1 and category 3, and the preliminary matching categories of the data to be marked Dn include category 2 and category 5. The server screens each datum D to be marked according to reference categories, wherein the reference categories are obtained through statistics according to marked data in marked data sets, and the reference categories specifically comprise category 1 and category 3. Through screening, it can be determined that the respective preliminary matching categories of the data to be marked D1 and the data to be marked D2 comprise reference categories, the preliminary matching category of the data to be marked D1 comprises category 1 and category 3 in the reference categories, and the preliminary matching category of the data to be marked D2 comprises category 3 in the reference categories. The server can respectively label the data D1 to be labeled and the data D2 to be labeled, and label the data according to respective preliminary matching categories, so that corresponding labeled data is obtained, the category label of the data D1 to be labeled is category 1, and the category label of the data D2 to be labeled is category 4. The server may update each obtained marked data to the marked data set to obtain an updated marked data set, specifically may add the marked data D1 and the marked data D2 to the marked data set, thereby completing the marking process, specifically, the marking process completes the marking of the data for the data D1 to be marked and the data D2 to be marked. After updating the marked data set, the server can continue marking the data D to be marked in the data set to be marked, and the specific server can continue marking the other data D to be marked except the data D1 to be marked and the data D2 to be marked in the data set to be marked until marking is finished for all the data D to be marked in the data set to be marked.
In the data labeling method, when at least two kinds of preliminary matching categories of the to-be-labeled data, which are obtained from the to-be-labeled data set, include a reference category determined by counting the labeled data in the labeled data set, labeling is performed on the to-be-labeled data based on at least two kinds of preliminary matching categories, labeled data corresponding to the to-be-labeled data are obtained, the labeled data set is updated through the labeled data of each to-be-labeled data, and labeling is performed on the to-be-labeled data in the to-be-labeled data set until data labeling is completed. When at least two kinds of preliminary matching categories matched with the data to be marked comprise reference categories, marking the data to be marked based on the at least two kinds of preliminary matching categories, the reference categories obtained by counting marked data in a marked data set can be utilized to adjust marking of the data to be marked, and repeated marking is carried out after the marked data set is updated according to the marked data, so that the data quantity of the marked data under different categories can be adjusted, and the category distribution of the marked data is ensured to be balanced, so that the training effect based on the marked data is enhanced.
In an exemplary embodiment, when at least two kinds of preliminary matching categories include a reference category, labeling the data to be labeled based on at least two kinds of preliminary matching categories, to obtain labeled data of the data to be labeled, including: acquiring a reference category corresponding to the marked data set, and determining a marking triggering condition of the reference category; and when the at least two preliminary matching categories comprise the reference category and the marking triggering condition is met, marking the targeted data to be marked based on the at least two preliminary matching categories, and obtaining marked data of the targeted data to be marked.
The reference category is obtained by counting the marked data in the marked data set, specifically, the data quantity of the marked data corresponding to various candidate categories can be counted, so that the reference category is determined from the various candidate categories, for example, the preset number of categories with the minimum data quantity of the corresponding marked data can be determined as the reference category. The marking triggering condition is used for judging whether to trigger marking processing aiming at the data to be marked, and the marking triggering condition can be set aiming at the reference category according to actual needs. For example, the labeling triggering condition may include that the data amount of the labeled data corresponding to the reference category is smaller than the quantity threshold, the matching quantization value corresponding to the reference category is larger than the preset matching value threshold, and the matching quantization value may be obtained by matching the data to be labeled with the reference category.
Optionally, the server may obtain a reference category corresponding to the noted dataset, and the specific server may query the noted dataset corresponding to the to-be-noted dataset, where the noted dataset is used for recording a labeling result for each to-be-labeled data in the to-be-labeled dataset, that is, after data labeling is completed for each to-be-labeled data in the to-be-labeled dataset, the labeled data obtained by labeling may be added to the labeled dataset. The server may perform statistics on labeled data included in the labeled data set in advance, specifically may perform statistics on data amounts of labeled data under different candidate categories, so as to determine a reference category from various candidate categories according to a statistical result. After determining the reference category, the server may determine a labeling triggering condition corresponding to the reference category, and when the labeling triggering condition is satisfied, may trigger labeling processing for the data to be labeled. The server may compare the at least two preliminary matching categories for the data to be annotated with the reference category, respectively, to determine whether the at least two preliminary matching categories include the reference category. When determining that at least two kinds of preliminary matching categories matched with the data to be marked comprise reference categories, the server can further determine whether the marking triggering condition is met, and when the marking triggering condition is met, the server can determine that marking processing is required to be carried out on the data to be marked, and the server can mark the data to be marked based on the at least two kinds of preliminary matching categories matched with the server, so that marked data of the data to be marked is obtained. In specific implementation, the server can select the target category from at least two matched preliminary matching categories as the data to be marked for marking, so as to obtain marked data corresponding to the data to be marked. In some embodiments, if the preliminary matching category does not include the reference category or does not satisfy the labeling triggering condition, the server may directly determine that the current labeling process is not performed on the data to be labeled, so as to obtain a determination of performing the current labeling on the next data to be labeled.
In some embodiments, the number of reference categories may be one or at least two. When the number of the reference categories is one, the reference category exists in at least two kinds of preliminary matching categories for matching the data to be marked, and the reference category can be considered to be included in the at least two kinds of preliminary matching categories. When the number of reference categories is at least two, at least one reference category may be present in the at least two preliminary matching categories that match, and the reference category may be considered to be included in the at least two preliminary matching categories. For example, the reference category may include three categories of category 1, category 3 and category 5, and only any one category of category 1, category 3 or category 5 of the at least two preliminary matching categories is needed, so that the at least two preliminary matching categories may be considered to include the reference category, and whether the labeling process needs to be triggered for the corresponding data to be labeled can be determined by the labeling triggering condition corresponding to the reference category.
In this embodiment, when the preliminary matching category of the data to be marked includes the reference category and the marking triggering condition corresponding to the reference category is satisfied, the server marks the data to be marked based on the preliminary matching category, and can adjust whether the data to be marked is marked by using the reference category obtained by counting the marked data in the marked data set, so as to dynamically adjust the data amount of the marked data in different categories, thereby ensuring that the category distribution of the marked data is balanced and obtaining a complete and balanced marked data set.
In one exemplary embodiment, the statistical parameters of the annotated data in the annotated data set, which are annotated as reference categories, satisfy a minority category criterion; when at least two kinds of preliminary matching categories include the reference category and meet the marking triggering condition, marking the targeted data to be marked based on the at least two kinds of preliminary matching categories, and obtaining marked data of the targeted data to be marked, including: when at least two preliminary matching categories comprise reference categories, obtaining matching results corresponding to the reference categories; the matching result corresponding to the reference category is obtained by matching the data to be marked with the reference category; and when the matching result corresponding to the reference category meets the marking triggering condition, marking the data to be marked according to at least two preliminary matching categories, and obtaining marked data of the data to be marked.
Wherein, in the marked data set, the reference category belongs to a few categories, namely, the number of marked data marked as the reference category in the marked data set is small. Specifically, the statistical parameter of the marked data in the marked data set marked as the reference category satisfies a minority category determination condition, which may be a data amount of the marked data, and the minority category determination condition may include a preset number threshold or a number of minority categories. The matching result is obtained by matching the data to be marked with the reference category, and the matching result can represent the correlation degree between the data to be marked and the reference category.
The server may count the marked data in the marked data set, specifically may count the data amount of the marked data corresponding to each candidate category, so as to obtain statistical parameters of each candidate category, and may screen each candidate category according to the statistical parameters of each candidate category and the few category judgment conditions, so as to determine the reference category, where the statistical parameters of the reference category satisfy the few category judgment conditions. For example, when the minority category determination condition includes a preset number threshold, the server may determine candidate categories having statistical parameters smaller than the number threshold as reference categories; in another example, when the minority category determination condition includes the number n of minority categories, the server may determine n candidate categories having the smallest statistical parameter values as the reference categories.
The server compares the preliminary matching category matched with the data to be marked with the reference category, and when the preliminary matching category is determined to comprise the reference category, the server can acquire a matching result corresponding to the reference category. The matching result is obtained by matching the data to be marked with the reference category, and the matching result specifically can comprise a matching quantized value, so that the correlation degree between the data to be marked and the reference category is represented by the matching quantized value. Specifically, the larger the value of the matching quantized value, the higher the degree of correlation between the data to be annotated and the reference category, i.e., the more likely the data to be annotated belongs to the reference category, can be considered. The server can compare the matching result with the labeling triggering condition, and determine whether the matching result corresponding to the reference category meets the labeling triggering condition, if the labeling triggering condition can include a matching value threshold, when the matching quantized value in the matching result is greater than or equal to the matching value threshold, the matching result corresponding to the reference category is considered to meet the labeling triggering condition, and the server can label the data to be labeled based on the preliminary matching category, so as to obtain labeled data corresponding to the data to be labeled. In some embodiments, if the preliminary matching category does not include the reference category, or the matching result corresponding to the reference category does not meet the marking triggering condition, the server may directly determine that the current marking process is not performed on the data to be marked, so as to obtain the determination of performing the current marking on the next data to be marked, until the data to be marked for the current marking is traversed.
In this embodiment, the reference category belongs to a minority category whose statistical parameter satisfies a minority category determination condition, the reference category is included in a preliminary matching category in which data to be marked is matched, and when a matching result of the reference category satisfies a marking trigger condition, the server marks the data to be marked based on the preliminary matching category, and can adjust whether the data to be marked is marked by using the minority category which belongs to a smaller data amount, and particularly preferentially marks the minority category which belongs to a smaller data amount, and can dynamically adjust the data amount of the marked data under different categories, thereby ensuring that the categories of the marked data are distributed uniformly, and thus obtaining a complete and balanced marking data set.
In an exemplary embodiment, the data labeling method further includes: when at least two preliminary matching categories do not include reference categories, acquiring non-minority category labeling conditions; when the matching results corresponding to the at least two preliminary matching categories respectively meet the non-minority category marking conditions, marking the targeted data to be marked based on the at least two preliminary matching categories, and obtaining marked data of the targeted data to be marked; the matching results of the at least two preliminary matching categories are obtained by matching the data to be marked with the at least two preliminary matching categories respectively.
Wherein the reference category belongs to a minority category whose statistical parameter satisfies a minority category determination condition. The non-minority category labeling condition is a labeling triggering condition set for the non-minority category and is used for judging whether the labeling process is performed for the data triggering of the non-minority category. The labeling conditions of the non-minority categories can be flexibly set according to actual needs. For example, the non-minority category labeling condition may include that a matching quantization value corresponding to the preliminary matching category is smaller than or equal to a preset matching value threshold, and the matching quantization value may be obtained by matching the data to be labeled with the preliminary matching category.
Specifically, when the preliminary matching category matched with the data to be marked does not include the reference category, the data to be marked can be considered to most probably belong to a plurality of categories, and the server can acquire preset non-minority category marking conditions so as to judge whether the marking is carried out on the data to be marked according to whether the non-minority category marking conditions trigger the marking. The server can acquire the matching results corresponding to each preliminary matching category, and the matching results are obtained by matching the data to be marked with the preliminary matching categories respectively. The server can compare the matching result corresponding to each preliminary matching category with the marking conditions of the non-minority categories respectively, and when the matching result is determined to meet the marking conditions of the non-minority categories, if the matching quantized value in the matching result is smaller than or equal to a preset matching value threshold value, the server considers that the marking conditions of the non-minority categories are met, the server can mark the data to be marked based on the preliminary matching categories, and the marked data corresponding to the data to be marked is obtained.
In a specific implementation, only the situation that the marking condition of the non-minority category is satisfied in the matching results corresponding to at least two kinds of preliminary matching categories is required, namely, the marking of the corresponding data to be marked can be triggered by considering that the marking condition of the non-minority category is satisfied. In some embodiments, if the matching results corresponding to the preliminary matching categories do not meet the labeling conditions of the non-minority categories, the server may directly determine that the current labeling process is not performed on the data to be labeled, so as to obtain a decision for performing the current labeling on the next data to be labeled, until the data to be labeled for the current labeling is traversed.
In this embodiment, the reference category belongs to a minority category whose statistical parameter satisfies a minority category determination condition, the reference category is not included in the preliminary matching category for matching the data to be marked, and when the matching results corresponding to the preliminary matching categories respectively satisfy a non-minority category marking condition, the server marks the data to be marked based on the preliminary matching category, and can adjust whether the data to be marked is marked by using the non-minority category, so as to ensure the data richness of the non-minority category.
In one exemplary embodiment, determining at least two preliminary matching categories from the at least two candidate categories that match the data to be annotated to includes: matching the data to be marked with at least two candidate categories respectively to obtain matching results corresponding to the at least two candidate categories respectively; and screening the at least two candidate categories according to the matching results respectively corresponding to the at least two candidate categories to obtain at least two preliminary matching categories matched with the data to be marked.
Specifically, for the data to be marked, the server may match the data to be marked with various candidate categories, so as to obtain matching results corresponding to the various candidate categories. In a specific implementation, the server may perform feature similarity matching on the data to be marked and various candidate categories, for example, features extracted from the data to be marked and features of the various candidate categories may be subjected to feature similarity matching, so as to obtain matching results corresponding to the various candidate categories. In some implementations, the server can also respectively match the data to be marked with various candidate categories through a pre-trained matching model to obtain a corresponding matching result. The server may screen for each candidate category based on the matching results respectively corresponding to each candidate category to determine at least two preliminary matching categories that match the data to be annotated. For example, the matching result may include a matching quantized value, specifically may include feature similarity, matching, and the like, and the server may determine a candidate category corresponding to the matching result in which the matching quantized value exceeds the quantized value threshold as a preliminary matching category matching the data to be annotated. In addition, the server may also determine a preset number of candidate categories with the highest numerical value of the matching quantized values as preliminary matching categories matching the data to be marked, for example, n (n is greater than or equal to 2) candidate categories with the highest numerical value of the matching quantized values may be determined as preliminary matching categories.
In this embodiment, according to the matching results of the data to be marked and various candidate categories, the primary matching category of the matching is determined from the various candidate categories, and the primary matching category for the data to be marked is primarily determined through the matching results, so that the data to be marked is further marked according to the primary matching category, and the accuracy of the data standard can be ensured.
In one exemplary embodiment, the targeted data to be annotated includes text to be annotated; matching the data to be marked with at least two candidate categories to obtain matching results corresponding to the at least two candidate categories, respectively, wherein the matching results comprise: splicing the text to be marked with at least two candidate categories respectively to obtain spliced texts corresponding to the at least two candidate categories respectively; acquiring a text matching model; the text matching model is obtained by updating marked data in the marked data set; and respectively carrying out text matching on the spliced texts respectively corresponding to the at least two candidate categories through a text matching model to obtain matching results respectively corresponding to the at least two candidate categories.
The data to be marked is data in a text mode, namely the data to be marked comprises the text to be marked. The spliced text is obtained by splicing the text to be marked and the candidate category, and the specific format can be 'text to be marked + candidate category'. The text matching model can be a network model constructed based on various artificial neural network algorithms, and can be obtained by updating marked data in marked data sets.
Specifically, for the data to be marked in the text mode, that is, when the data to be marked includes the text to be marked, the server can directly splice the text to be marked with various candidate categories respectively, for example, splice the text to be marked according to a fixed format, and obtain a spliced text. And respectively splicing each piece of data to be marked with each candidate category, and if the number of the types of the candidate categories is m, obtaining m spliced texts by splicing each piece of data to be marked, wherein each spliced text corresponds to one candidate category. The server obtains a text matching model, which is updated for the update based on the annotated data in the annotated data set. The server can perform text matching on each spliced text through the text matching model, and specifically can input each spliced text into the text matching model for text matching respectively, so that the text matching model outputs corresponding matching results, and matching results corresponding to various candidate categories respectively are obtained.
In this embodiment, for a text to be marked, after the server respectively splices the text to be marked with various candidate categories, text matching is performed on each spliced text through a text matching model, so that the text to be marked is accurately matched with various candidate categories, and a preliminary matching category for matching the text to be marked is accurately determined according to a matching result.
In one exemplary embodiment, as shown in FIG. 4, obtaining a text matching model includes:
step 402, acquiring a text matching model to be updated and a marked data set; the annotated data in the annotated data set includes annotated text and annotation categories for the annotated text.
The text matching model to be updated is an unexpended text matching model, and specifically may be a text matching model adopted in the last labeling or an initial text matching model. The labeling category is a category label labeled for the labeled text. Alternatively, the server may obtain a text matching model to be updated and a labeled dataset, where the labeled dataset includes labeled data, and the labeled dataset specifically includes labeled text and labeling categories for the labeled text.
And step 404, performing text matching on the marked text and the marking category corresponding to the marked text through a text matching model to be updated, and obtaining a matching result of the marked text.
The server performs text matching on the marked text and the marking category corresponding to the marked text through the text matching model to be updated, specifically, the marked text and the marking category corresponding to the marked text can be input into the text matching model to be updated for text matching, and the matching result of the marked text is output by the text matching model to be updated.
And step 406, updating the model parameters of the text matching model to be updated according to the matching result of the marked text, and obtaining the text matching model.
Specifically, the server may update the text matching model to be updated by using the matching result of the annotated text, so as to obtain the text matching model. In a specific implementation, the server may determine an adjustment parameter according to a matching result of the annotated text, and update a model parameter of the text matching model to be updated according to the adjustment parameter, so as to obtain the text matching model. The server can utilize the obtained text matching model to carry out text matching on the data to be marked for the current marking.
In this embodiment, the server updates the model parameters of the text matching model to be updated to obtain the text matching model through the marked text in the marked data set and the corresponding marking category, and can dynamically adjust the text matching model by using the marked data in the marked data set, so as to improve the text matching accuracy of the text matching model, and accurately determine the preliminary matching category of the text to be marked according to the matching result.
In an exemplary embodiment, screening at least two candidate categories according to matching results respectively corresponding to the at least two candidate categories to obtain at least two preliminary matching categories matched with the data to be marked, including: obtaining a matching result reservation condition; screening at least two reserved categories meeting the reserved conditions of the matched result from the at least two candidate categories according to the matched results respectively corresponding to the at least two candidate categories; and obtaining at least two preliminary matching categories matched with the data to be marked based on the at least two reserved categories.
The matching result retention conditions are used for screening various candidate categories, and the candidate category corresponding to the matching result meeting the matching result retention conditions can be determined as the category matched with the data to be marked. The matching result reservation condition can be flexibly set according to actual needs, for example, the matching degree threshold value, the number of matching categories and the like can be included. The reservation category is a candidate category which meets the reservation condition of the matching result and is screened from various candidate categories. For example, when the matching result retention condition includes a matching degree threshold, the retention category may be a candidate category in which the matching quantized value exceeds the matching degree threshold in the corresponding matching result.
Optionally, the server acquires preset matching result reservation conditions, and the server screens based on matching results corresponding to various candidate categories respectively to obtain reservation categories meeting the matching result reservation conditions, wherein the reservation categories are at least two. For example, the matching result includes a matching quantization value, and the matching result reservation condition includes the number x (x is greater than or equal to 2) of matching categories, and then the server may determine, according to the matching quantization value corresponding to each candidate category, the x candidate categories with the largest matching quantization value as reserved categories. The server obtains a preliminary matching category matched with the data to be marked based on the determined reserved category, for example, the server can directly use the reserved category as the preliminary matching category matched with the data to be marked. In addition, when the number of reserved categories is larger, for example, when the number of reserved categories exceeds a number threshold, the server may further screen for each reserved category to obtain a preliminary matching category.
In this embodiment, the server screens out the reserved categories meeting the reserved conditions of the matching result from the various candidate categories according to the matching results respectively corresponding to the various candidate categories, and determines the preliminary matching category according to the reserved categories, so that the preliminary matching category can be accurately screened out through the reserved conditions of the matching result, and the accuracy of the preliminary matching category can be ensured.
In an exemplary embodiment, labeling the data to be labeled based on at least two preliminary matching categories to obtain labeled data of the data to be labeled, including: acquiring a data annotation model and annotation prompt words; labeling the data to be labeled according to the labeling prompt words and at least two preliminary matching categories through a data labeling model, and obtaining labeling results of the data to be labeled; and obtaining marked data aiming at the data to be marked based on the marking result.
The data labeling model is used for labeling input data, and can specifically target the category to which the input data labeling belongs. The data annotation model can be constructed based on various artificial neural network algorithms in advance. Labeling a hint word is a way to launch a machine learning model, which is a piece of text or sentence that is used to instruct the machine learning model to generate an output of a particular type, topic, or format. Specifically, the annotation cue is used to instruct the data annotation model to generate a category of a particular type, topic, or format. The labeling result may include category labels labeled for the data to be labeled, and labeled data may be obtained based on the labeling result.
For example, the server may obtain a pre-built data annotation model and obtain a pre-set annotation prompt. The server can label the data to be labeled through the data labeling model, and the specific server labels the data to be labeled according to the labeling prompt words and the matched preliminary matching categories through the data labeling model, so that a labeling result of the data to be labeled is obtained. In a specific implementation, the server can input the labeling prompt word, the data to be labeled and the preliminary matching category of the data to be labeled to the data labeling model for labeling, so that the labeling result of the data to be labeled is output by the data labeling model. The server can obtain marked data based on the marking result of the data to be marked, and the specific server can obtain the marked data after adding the category labels in the marking result to the data to be marked.
In the embodiment, the server marks the data to be marked through the data marking model and the marking prompt words, so that the accuracy and the processing efficiency of the data marking can be effectively improved.
In one exemplary embodiment, obtaining a data annotation model and annotation hints comprises: acquiring a labeling data sample, and determining a candidate labeling model and a candidate prompt word; labeling according to the candidate prompt words and the labeling data samples through the candidate labeling model to obtain sample labeling results of the labeling data samples; and when the sample labeling result is matched with the category label of the labeling data sample, obtaining a data labeling model according to the candidate labeling model, and obtaining labeling prompting words according to the candidate prompting words.
The labeling data sample is sample data for selecting a data labeling model and labeling prompt words, and can be extracted from all data to be labeled according to actual needs. The candidate labeling models and the candidate prompting words are models to be judged and prompting words, and the data labeling models and the labeling prompting words meeting the requirements are selected by judging the candidate labeling models and the candidate prompting words. The sample labeling result is obtained by labeling the labeling data sample through the candidate labeling model and the candidate prompt word. The category labels are labels obtained by labeling the labeling data samples in advance.
Illustratively, the server obtains a sample of annotation data and determines candidate annotation models and candidate hint words. The server can label the labeling data sample through the candidate labeling model and the candidate prompt word, and a sample labeling result of the labeling data sample is obtained. In the specific application, the server can input the labeling data sample and the candidate prompt word into the candidate labeling model for labeling, and the candidate labeling model outputs a sample labeling result of the labeling data sample. The server can determine the category label of the labeling data sample, and match the sample labeling result with the category label, for example, the difference between the sample labeling result and the category label can be determined, and the matching degree between the sample labeling result and the category label is determined based on the difference. When the sample labeling result is matched with the category label, the server can determine that the candidate labeling model and the candidate prompting word meet labeling requirements, namely, the server can obtain a data labeling model according to the candidate labeling model and obtain the labeling prompting word according to the candidate prompting word. The specific server can directly determine the candidate annotation model as a data annotation model and determine the candidate prompt word as an annotation prompt word.
In some embodiments, if the sample labeling result does not match with the category label of the labeling data sample, it indicates that the current candidate labeling model and candidate prompting word do not meet the labeling requirement, and the server may reselect the candidate labeling model and candidate prompting word to further screen the reselected candidate labeling model and candidate prompting word until the sample labeling result matches with the category label of the labeling data sample, and obtain the acquired data labeling model and labeling prompting word according to the corresponding candidate labeling model and candidate prompting word.
In this embodiment, the server marks the marked data sample through the candidate marking model and the candidate prompting word, and screens the candidate marking model and the candidate prompting word based on the matching degree between the sample marking result and the category label of the marked data sample, so as to determine the data marking model and the marking prompting word, and ensure the accuracy of marking the data marking model and the marking prompting word, thereby ensuring the accuracy and the processing efficiency of the data marking.
In an exemplary embodiment, when a sample labeling result is matched with a category label of a labeled data sample, obtaining a data labeling model according to a candidate labeling model, and obtaining a labeling prompt according to a candidate prompt, including: obtaining a category label of the labeling data sample; the category labels comprise at least two labeling labels which are obtained by labeling the labeling data samples based on different labeling modes; determining a first labeling difference between the sample labeling result and at least two labeling labels and a second labeling difference between the at least two labeling labels respectively; and when the first annotation difference is matched with the second annotation difference, determining the candidate annotation model as a data annotation model, and determining the candidate prompt word as an annotation prompt word.
The category labels at least comprise two labeling labels, and the two labeling labels can be obtained by labeling the labeling data samples based on different labeling modes. Different labeling modes can be realized by different labeling rules, different labeling subjects and the like. For example, a plurality of labeling labels can be obtained by labeling the labeling data samples by different labeling people. The first labeling difference characterizes the difference between the labeling result of the sample and the category label, namely the difference between the labeling result of the candidate labeling model and the candidate prompt word and the labeling results of other labeling modes. The second labeling difference characterizes the difference between the labeling labels, namely the difference between labeling results representing different labeling modes.
Specifically, after the sample labeling result of the labeling data sample is obtained, the server may obtain category labels of the labeling data sample, where the category labels include at least two labeling labels obtained by labeling the labeling data sample in different labeling modes. The server can determine first annotation differences between the sample annotation result and at least two annotation tags respectively, and the concrete server can compare the sample annotation result with each annotation tag respectively to obtain corresponding first annotation differences. In addition, the server may determine a second annotation difference between each of the annotation tags, and the specific server may compare each of the annotation tags to obtain the second annotation difference. The server may compare the first annotation difference with the second annotation difference to determine whether the first annotation difference matches the second annotation difference, e.g., the server may perform a difference evaluation on the first annotation difference and the second annotation difference to obtain an evaluation result. When the evaluation result shows that the first annotation difference is matched with the second annotation difference, the first annotation difference is considered to be matched with the second annotation difference, the server can obtain a data annotation model according to the candidate annotation model, obtain the annotation prompt word according to the candidate prompt word, specifically determine the candidate annotation model as the data annotation model, and determine the candidate prompt word as the annotation prompt word.
In this embodiment, the server screens the candidate labeling model and the candidate prompt word through the first labeling difference between the sample labeling result and the category label and the second labeling difference between each labeling label to determine the data labeling model and the labeling prompt word, so that the accuracy of labeling the data labeling model and the labeling prompt word can be ensured, and the accuracy and the processing efficiency of data labeling are ensured.
In an exemplary embodiment, obtaining marked data for the data to be marked based on the marking result includes: obtaining constraint conditions of labeling results; updating the labeling result based on the labeling result constraint condition to obtain an updated labeling result; and obtaining marked data of the data to be marked according to the updated marking result.
The labeling result constraint condition is used for carrying out standardization processing on the labeling result so that the labeling result meets the standardization requirement, and therefore the effectiveness of the labeling result is ensured. The constraint conditions of the labeling result can be flexibly set according to the actual application requirements, and can comprise normalized conditions such as labeling range, labeling format and the like.
Optionally, the server may obtain a preset constraint condition of the labeling result, and update the labeling result according to the constraint condition of the labeling result to obtain an updated labeling result. If the server can carry out templating processing on the labeling result according to the labeling template in the labeling result constraint condition, the templated labeling result is obtained. The server can obtain marked data of the data to be marked according to the updated marking result, and the specific server can add category labels corresponding to the updated marking result to the data to be marked to obtain corresponding marked data, so that marking processing of the data to be marked is realized.
In this embodiment, the server updates the labeling result according to the constraint condition of the labeling result, and obtains labeled data based on the updated labeling result, so as to ensure that the labeling result meets the constraint condition, thereby ensuring the labeling validity of the labeled data.
In an exemplary embodiment, updating the marked data set by the marked data of each of the at least one data to be marked, and continuing marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked, including: and adding the marked data of each at least one piece of data to be marked into the marked data set, and returning to execute the step of acquiring the at least one piece of data to be marked from the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
For example, for the resulting annotated data, the server may add individual annotated data to the annotated data set, thereby enabling updating of the annotated data set. The server can return to the step of acquiring at least one datum to be marked from the datum to be marked, so that the next marking processing is carried out on the datum to be marked in the datum to be marked by utilizing the updated marked datum until the marking of the datum to be marked in the datum to be marked is finished. In a specific application, after the server updates the marked data set, the server may propose the data to be marked corresponding to the marked data from the data set to be marked, so as to update the data set to be marked, and perform the next marking on the unmarked data to be marked.
In this embodiment, the server adds the marked data obtained by the marking to the marked data set and performs the marking of the next time, so that the marked data set can be repeatedly marked after being updated according to the marked data, the data quantity of the marked data under different classes can be adjusted, the class distribution of the marked data is ensured to be balanced, and the training effect based on the marked data is enhanced.
The application scene also provides an application scene, and the application scene applies the data labeling method. Specifically, the application of the data labeling method in the application scene is as follows:
in this embodiment, the labeling data is data of an image mode, that is, the data to be labeled in the data set to be labeled is the image to be labeled. The server may obtain at least one image to be annotated from the set of data to be annotated and determine various candidate categories. For the acquired at least one image to be annotated, the server traverses each image to be annotated, and determines a preliminary matching category matched with the image to be annotated from various candidate categories. The server acquires a reference category determined by counting marked images in the marked data set, and when the reference category is included in the preliminary matching category for determining that the images to be marked are matched, the server can mark the images to be marked based on the preliminary matching category, so that the corresponding marked images are obtained. After traversing each image to be annotated to obtain a corresponding annotated image, the server can update the annotated data set through each annotated image, and annotate the images to be annotated in the annotated data set continuously until the annotating of all the images to be annotated is completed.
The application scene also provides an application scene, and the application scene applies the data labeling method. Specifically, the application of the data labeling method in the application scene is as follows:
when constructing some text classification or text intention recognition tasks, a large number of labeled data sets are often required to realize model learning so as to achieve expected model effects. In the financial field, due to its professionals and specificity, there are very few data sets that are mature in general, for different types of tasks, it is necessary to construct corresponding personalized data sets. The current solutions in the industry are labeling by manually labeling the dataset or designing a set of keyword rule systems. However, in some problems with a large number of categories and problems with a large difference in category distribution length and tail, it is difficult to find the data of the tail category, and a large labeling cost is required to construct a data set with a sufficient number of categories. Moreover, the problem of manual labeling is that labeling personnel are required to have a certain financial field knowledge, the labeling difficulty is high, the cost is high, and the efficiency is low. In addition, the keyword mode design needs the intervention of professional staff when the keyword rule annotation is designed, meanwhile, the data set distribution bias can be caused due to the fixed annotation mode, and the professional staff is required to conduct data audit due to the keyword accuracy problem.
Based on this, when the data labeling method provided in this embodiment labels a text, the short text labeling dataset in the financial field can be fully automatically expanded through text matching, active learning and LLM (Large Language Model ) large language model, the past manual or rule labeling method is replaced by the knowledge understanding capability, professional field knowledge and output capability of the large language model, and the distribution balance requirement of the labeling dataset is realized by means of the text matching and active learning method. Specifically, a BERT (Bidirectional Encoder Representation from Transformers, encoder based on a bidirectional converter) model is adopted to vectorize data and categories to be marked, top-x (i.e. most matched x) optimal results are matched through a matching algorithm and are given to a large model for marking, new data sets are used for adjusting parameters of the BERT model after marking, matching of data to be marked and the categories to be classified is repeated to generate a new batch of matching results, text data with the same marking result as the bottom-n (i.e. least n) categories with the least data quantity under the categories in the marked data sets is selected and given to the large model for marking. And simultaneously, selecting data with low matching threshold values to form a new data set to be marked for marking the data set to a large model for the category of sufficient data quantity, and repeating the steps for a plurality of times. The whole process can realize a complete and balanced labeling data set without personnel intervention. The data labeling method provided by the embodiment can be applied to the early data construction of various text classification tasks, and the classification recognition effect is improved by constructing a complete personalized labeling data set, and finally the experience of various application scenes of a platform is improved, for example, the method can be applied to classification scenes of customer complaints, recognition scenes of user intentions in customer service and the like.
Specifically, the input of the data labeling method provided by the embodiment is a non-labeled text and category system designed based on a specific task, firstly, text and category are matched after vectorization through Embedding, text data in a missing state in a labeled data set is selected through an active learning algorithm, and if the number of data under the category is small, or the number of data of a certain type under the category is small. And delivering the selected unlabeled data and the most possible x categories matched with the vector to a large model for labeling, and outputting a labeling result to enter a labeling data set. And then performing model tuning on an Embedding algorithm in text matching through the newly added marked data, enhancing the matching capacity of the Embedding algorithm, repeating the previous steps, continuously matching unmarked data, and continuously iterating. And finally obtaining the marked data set with complete data and uniform distribution. Wherein, the Embedding is the collective term of language model and characteristic learning technique in natural language processing, conceptually, it means Embedding a high-dimensional space with number of all words into a continuous vector space with much lower dimension, each word or phrase is mapped into a vector on real number domain; the text matching is used for judging whether the two text sections express the same semantics; the active learning is an algorithm for acquiring sample data which are difficult to classify through a machine learning method, then, training the data obtained by labeling by using a supervised learning model or a semi-supervised learning model again, and gradually improving the effect of the model; the LLM large model, or large language model (Large Language Model, LLM), also known as large language model, is an artificial intelligence model intended to understand and generate human language.
As shown in FIG. 5, the data to be annotated is unlabeled text, and for the input unlabeled text and category system, at least two candidate categories may be included in the category system. And the server respectively performs text matching on the unlabeled text and each candidate category in the category system, and samples the unlabeled text based on active learning so as to screen the unlabeled text which needs to be labeled at the time. The reference category is specifically included in the preliminary matching category corresponding to the unlabeled text which needs to be labeled at the present time. The server marks the unlabeled text which needs to be marked at this time through the large model, and adds the obtained marked data into a marked data set. The server may provide the category bottom-n with the least amount of annotation data based on the annotation data set to determine the reference category for sample screening. The server may also perform word-embedded trimming processing for text matching based on the annotation dataset, and may specifically perform text matching trimming according to the Fine Tune Embedding (word-embedded trimming) algorithm.
Further, for the text matching process, there are three parts of its input: unlabeled text, category system and labeled data set with labels completed. The annotation dataset may be empty when the dataset is built for the first round. The function of the labeling dataset is to perform model fine tuning on the BERT model so that the text Embedding of the BERT model is more adaptive. Wherein the BERT model is a pre-trained language characterization model that can generate deep bi-directional language characterizations. Model tuning refers to the process of optimizing model performance by adjusting parameters and super parameters of the model in machine learning and deep learning. The embodiment adopts a single tower model based on BERT, and optimizes the effect of the BERT model on text matching by constructing a tuning data set through a Pairwise (pairing) method. Specifically, as shown in fig. 6, the server may splice the unlabeled sample with each category in the category system, and if n unlabeled samples and m categories are provided, construct n×m text data, where the text format is "[ cls ] text [ sep ] category", score the matching degree of these data through a matching model, specifically through a BERT model, and reserve the top-x of the scores as the x categories that are most matched with the text. Because the LLM large model has input limitation, the input content of the LLM large model cannot be overlong, x categories are reserved, so that the understanding pressure of the LLM large model when the category system is excessive can be relieved when the LLM large model is marked, and the marking effect is ensured.
Further, for the active learning based sampling process, the inputs are the distribution of the annotation dataset and the text and the categories and scores of their top-x matches, i.e., the output of the text matching module. Acquiring n categories with the least marked data quantity, namely bottom-n, through the distribution of the marked data set based on the sampling processing of active learning; and then comparing the text and the matching result thereof, and entering into a threshold score evaluation of the category of the bottom-n, wherein by designing a threshold A, the subsequent large model labeling can be carried out only if the matching score is higher than the threshold A, thereby obtaining more reliable text data of the category of low data volume. In addition, if the text and the matching result thereof do not belong to the category of bottom-n, the text enters another category of threshold score evaluation, and by designing a threshold B, the subsequent large model labeling can be performed only if the matching score is lower than the threshold B, so that the text data with sufficient data quantity can be obtained by the data with different types, and the data richness is increased. The output of sampling processing based on active learning is obtained through the steps, and the data set to be marked and top-x category matching results obtained through the text matching module of the data are obtained.
As shown in fig. 7, for the labeling dataset and the text and the top-x matched categories thereof, the server may determine whether the top-x category of the text belongs to the category bottom-n with the smallest labeling amount, if so, further determine whether the matching score of the top-x category of the text is greater than or equal to the threshold a, if so, determine the text as a sample to be labeled, add the sample to the sample set to be labeled, and determine the category party a of the top-x thereof, so as to perform labeling processing subsequently. If the matching score of the top-x category of the text is smaller than the threshold A, acquiring the next text and re-judging. On the other hand, when the top-x category of the text does not belong to the category bottom-n with the minimum labeling amount, the server may determine whether the matching score of the top-x category of the text is less than or equal to the threshold B, if so, determine the text as a sample to be labeled, add the sample to the sample set to be labeled, and determine the category part B of the top-x of the sample to be labeled, so as to perform labeling processing subsequently. If the matching score of the top-x category of the text is greater than the threshold B, acquiring the next text and re-judging. And after traversing each text, obtaining a sample set to be marked and a top-x category corresponding to the sample set to be marked.
Further, for large model labeling, as shown in fig. 8, for the obtained sample set to be labeled and top-x category, the server may perform prompt word engineering design, so as to determine the prompt word adopted by the labeling. The server further can conduct question-answer labeling through the selected LLM large model, conduct text template standard correction on labeling output, and correct the large model output result based on a template filtering mechanism so as to obtain a labeling data set with completed labeling. In the specific implementation, the server can complete three-aspect work in the early stage of the flow, and the server can automatically operate after the completion without manual intervention.
The first aspect is Prompt word engineering design, the LLM large model is essentially a generated model, and the model can be guided to better understand task demands through carefully designed Prompt words, so that a more efficient and accurate output result is realized. The Prompt term Prompt may be understood as a way to launch a machine learning model, which is a piece of text or sentence that directs the machine learning model to generate a particular type, topic, or format of output. The method can be used for designing proper prompt words by adopting a prompt word method of direct questioning, thinking chain, clues, reasons and the like. The large model has input limitation, namely the length of the prompting words is limited, so that the text matching module can select top-x matching intention to compress the prompting words, and the understanding pressure of the large model is reduced.
The second aspect is the selection of large models, different model capacities and emphasis, and proper large models need to be selected for labeling tasks according to actual tasks. Specifically, as shown in fig. 9, when determining a large model and a prompt word, a small amount of data is sampled first, two annotators (annotators a and annotators B) respectively annotate the same batch of sampled data to obtain two batches of annotation results of the same data, and difference evaluation is performed on the annotation results between the annotators to obtain a classification F 1 Value F 1 A&B . Marking the same batch of sampling data by the selected large model and the set prompt words, and respectively counting the difference evaluation between the large model and two marking people to obtain F 1 M&A And F 1 M&B The difference between the representative model and the labeling person A and the difference between the model and the labeling person B. When F 1 A&B ≈F 1 M&A &F 1 A&B ≈F 1 M&B When it indicates F 1 M&A And F 1 M&B And F is equal to 1 A&B The gap between the model and the model is smaller, the labeling capability of the large model can be considered to be close to that of a professional labeling person, so that the large model with the labeling capability and the prompting word are obtained, otherwise, the current large model is selected and the prompting word engineering is insufficient to meet the labeling task requirement, and the large model or the prompting word needs to be replaced.
The third aspect is that, due to the generative model characteristics of the large model, the output still has a certain illusion problem, which can lead to the situation of unmatched results with standard categories, at this time, the output needs to be subjected to templated filtering, and the output is normalized by a text regularization method so as to determine the final labeling answer.
Based on the three aspects, the large model can realize the labeling capability of the professional labeling personnel, label the data and output a labeling data set. Further, the output labeling data set can be used for depth model tuning in the text matching module, so that the model has better effect on the matching of texts and the Embedding of classification categories, and then the steps of text matching, active learning-based sampling and large model labeling are repeated until the categories in the labeling data set have sufficient data quantity meeting the requirements.
The data labeling method provided by the embodiment can realize the construction processing of a text data set aiming at the financial field, and can solve the problems of high data labeling cost, high professional knowledge barrier, easy distribution of long tails, bias and the like in such tasks. Specifically, for the long-tail problem of multiple categories, text matching between financial short text and classification category is performed by a text Embedding method, and the classification option of top-x is selected for selection by a large model, so that the top-x category is used instead of the whole category library because the large model has input limit and illusion problems, and the input of overlong text needs to be avoided as much as possible, so that the most probable category is selected by using a text matching mode. After the large model is marked, the word Embedding Fine tuning (Fine-tuning) algorithm of marked data is adopted, so that the Fine tuning algorithm is more suitable for the current task, then text matching is carried out on a new text of unmarked data, the data with the matching result top-n being an unusual class is selected, the data with the lowest matching similarity is selected for the data with the top-n being a common class, the large model is marked, fine tuning is carried out based on the word Embedding Fine tuning algorithm, multiple rounds are circulated, and full data marking of all the classes is finally achieved.
Further, for the manual labeling problem, the threshold problem of labeling personnel can be solved through professional knowledge of a large model in the financial field and text understanding capability, and meanwhile, the labeling efficiency is far more than manual through deployment of the large model. The data labeling method provided in this embodiment deploys a suitable large model, and then constructs a Prompt project by a COT (Chain-of-thoughts) method or a CART (Classification And Regression Tree, classification regression tree) algorithm, so that the large model judges the classification to which the text belongs. The method for selecting the large model is to perform cross verification with the sampled artificial data, and the large model is considered to be suitable if the annotation gap between the large model and the person is close to the annotation gap between the person. In addition, for the problem of bias of a keyword mode, the data bias nature caused by the keyword mode is also a long tail problem, an Embedding algorithm is optimized through an active learning method, repeated labeling of common types of data is avoided, for example, top-n is selected as data of common categories after the second round of active learning, but the data are data with lowest matching similarity, so that the data richness of categories of more data is increased, and bias is avoided.
In a specific application, the data labeling method provided by the embodiment is applied to a financial platform work order analysis system, and a complete and rich data set is built for user intention classification tasks in work order analysis at low cost. In the data set construction of the task, 800 pieces of data can be annotated every day by manual annotation. By the data labeling method provided by the embodiment, 13w pieces of data can be labeled daily, and the efficiency is improved by 163 times. In addition, the machine cost of the model marking single data is 0.01 yuan, the cost of the manpower marking single data is 0.28 yuan, the cost is reduced by 96.5%, and the optimization effect is obvious.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data marking device for realizing the data marking method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data labeling device provided below may refer to the limitation of the data labeling method hereinabove, and will not be repeated herein.
In one exemplary embodiment, as shown in fig. 10, there is provided a data tagging device 1000 comprising: a data acquisition module 1002, a preliminary matching module 1004, a callout triggering module 1006, and a dataset update module 1008, wherein:
the data obtaining module 1002 is configured to obtain at least one data to be annotated from the data set to be annotated, and determine at least two candidate categories;
the preliminary matching module 1004 is configured to determine, for each data to be annotated in the at least one data to be annotated, at least two preliminary matching categories that match the data to be annotated from at least two candidate categories;
the labeling triggering module 1006 is configured to label the target data to be labeled based on at least two kinds of preliminary matching categories when the at least two kinds of preliminary matching categories include a reference category, so as to obtain labeled data of the target data to be labeled; the reference category is determined by counting the annotated data in the annotated data set;
The data set updating module 1008 is configured to update the marked data set by the marked data of each of the at least one data to be marked, and continue marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
In one embodiment, the labeling triggering module 1006 is further configured to obtain a reference category corresponding to the labeled dataset, and determine a labeling triggering condition of the reference category; and when the at least two preliminary matching categories comprise the reference category and the marking triggering condition is met, marking the targeted data to be marked based on the at least two preliminary matching categories, and obtaining marked data of the targeted data to be marked.
In one embodiment, the statistical parameters of the marked data in the marked data set marked as the reference category satisfy the minority category determination condition; the annotation triggering module 1006 is further configured to, when at least two types of preliminary matching categories include a reference category, obtain a matching result corresponding to the reference category; the matching result corresponding to the reference category is obtained by matching the data to be marked with the reference category; and when the matching result corresponding to the reference category meets the marking triggering condition, marking the data to be marked according to at least two preliminary matching categories, and obtaining marked data of the data to be marked.
In one embodiment, the method further comprises a non-minority category labeling module, wherein the non-minority category labeling module is used for acquiring non-minority category labeling conditions when at least two primary matching categories do not comprise reference categories; when the matching results corresponding to the at least two preliminary matching categories respectively meet the non-minority category marking conditions, marking the targeted data to be marked based on the at least two preliminary matching categories, and obtaining marked data of the targeted data to be marked; the matching results of the at least two preliminary matching categories are obtained by matching the data to be marked with the at least two preliminary matching categories respectively.
In one embodiment, the preliminary matching module 1004 is further configured to match the data to be marked with at least two candidate categories, so as to obtain matching results corresponding to the at least two candidate categories respectively; and screening the at least two candidate categories according to the matching results respectively corresponding to the at least two candidate categories to obtain at least two preliminary matching categories matched with the data to be marked.
In one embodiment, the targeted data to be annotated includes text to be annotated; the preliminary matching module 1004 is further configured to splice the text to be annotated with at least two candidate categories, so as to obtain spliced texts corresponding to the at least two candidate categories; acquiring a text matching model; the text matching model is obtained by updating marked data in the marked data set; and respectively carrying out text matching on the spliced texts respectively corresponding to the at least two candidate categories through a text matching model to obtain matching results respectively corresponding to the at least two candidate categories.
In one embodiment, the preliminary matching module 1004 is further configured to obtain a text matching model to be updated and a labeled dataset; the marked data in the marked data set comprises marked texts and marking categories aiming at the marked texts; performing text matching on the marked text and the marking category corresponding to the marked text through a text matching model to be updated to obtain a matching result of the marked text; and updating the model parameters of the text matching model to be updated according to the matching result of the marked text to obtain the text matching model.
In one embodiment, the preliminary matching module 1004 is further configured to obtain a matching result retention condition; screening at least two reserved categories meeting the reserved conditions of the matched result from the at least two candidate categories according to the matched results respectively corresponding to the at least two candidate categories; and obtaining at least two preliminary matching categories matched with the data to be marked based on the at least two reserved categories.
In one embodiment, the annotation trigger module 1006 is further configured to obtain a data annotation model and annotation hint words; labeling the data to be labeled according to the labeling prompt words and at least two preliminary matching categories through a data labeling model, and obtaining labeling results of the data to be labeled; and obtaining marked data aiming at the data to be marked based on the marking result.
In one embodiment, the annotation trigger module 1006 is further configured to obtain an annotation data sample, and determine a candidate annotation model and a candidate hint word; labeling according to the candidate prompt words and the labeling data samples through the candidate labeling model to obtain sample labeling results of the labeling data samples; and when the sample labeling result is matched with the category label of the labeling data sample, obtaining a data labeling model according to the candidate labeling model, and obtaining labeling prompting words according to the candidate prompting words.
In one embodiment, the labeling trigger module 1006 is further configured to obtain a category label of the labeling data sample; the category labels comprise at least two labeling labels which are obtained by labeling the labeling data samples based on different labeling modes; determining a first labeling difference between the sample labeling result and at least two labeling labels and a second labeling difference between the at least two labeling labels respectively; and when the first annotation difference is matched with the second annotation difference, determining the candidate annotation model as a data annotation model, and determining the candidate prompt word as an annotation prompt word.
In one embodiment, the labeling trigger module 1006 is further configured to obtain a labeling result constraint condition; updating the labeling result based on the labeling result constraint condition to obtain an updated labeling result; and obtaining marked data of the data to be marked according to the updated marking result.
In one embodiment, the data set update module 1008 is further configured to add the respective marked data of the at least one data to be marked to the marked data set, and return to perform the step of obtaining the at least one data to be marked from the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
The modules in the data marking device can be realized in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device, which may be a terminal or a server, is provided, and an internal structure diagram thereof may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing the processed data by the data labeling method. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data tagging method.
It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (23)

1. A method of labeling data, the method comprising:
acquiring at least one data to be marked from a data set to be marked, and determining at least two candidate categories;
for each datum to be marked in the at least one datum to be marked, determining at least two preliminary matching categories matched with the aimed datum to be marked from the at least two candidate categories;
Acquiring a reference category corresponding to the marked data set, and determining a marking triggering condition of the reference category; the reference category is determined by counting the marked data in the marked data set; the statistical parameters of the marked data marked as the marked data of the reference category in the marked data set meet the condition of judging a few categories;
when the at least two preliminary matching categories comprise the reference category, obtaining a matching result corresponding to the reference category; the matching result corresponding to the reference category is obtained by matching the data to be marked with the reference category; when the matching result corresponding to the reference category meets the marking triggering condition, marking the data to be marked according to the at least two preliminary matching categories, and obtaining marked data of the data to be marked;
when the reference category is not included in the at least two preliminary matching categories, acquiring non-minority category labeling conditions; when the matching results corresponding to the at least two preliminary matching categories respectively meet the non-minority category marking conditions, marking the targeted data to be marked based on the at least two preliminary matching categories to obtain marked data of the targeted data to be marked; the matching results corresponding to the at least two preliminary matching categories are obtained by matching the data to be marked with the at least two preliminary matching categories respectively;
Updating the marked data set through the marked data of each of the at least one piece of data to be marked, and continuing marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
2. The method of claim 1, wherein the determining at least two preliminary matching categories from the at least two candidate categories that match the data to be annotated comprises:
matching the data to be marked with the at least two candidate categories respectively to obtain matching results corresponding to the at least two candidate categories respectively;
and screening the at least two candidate categories according to the matching results respectively corresponding to the at least two candidate categories to obtain at least two preliminary matching categories matched with the data to be marked.
3. The method according to claim 2, wherein the targeted data to be annotated comprises text to be annotated; the step of respectively matching the data to be marked with the at least two candidate categories to obtain matching results respectively corresponding to the at least two candidate categories comprises the following steps:
Splicing the text to be marked with the at least two candidate categories respectively to obtain spliced texts corresponding to the at least two candidate categories respectively;
acquiring a text matching model; the text matching model is obtained by updating marked data in the marked data set;
and respectively performing text matching on the spliced texts respectively corresponding to the at least two candidate categories through the text matching model to obtain matching results respectively corresponding to the at least two candidate categories.
4. The method of claim 3, wherein the obtaining a text matching model comprises:
acquiring a text matching model to be updated and a marked data set; the marked data in the marked data set comprises marked texts and marking categories aiming at the marked texts;
performing text matching on the marked text and the marking category corresponding to the marked text through the text matching model to be updated to obtain a matching result of the marked text;
and updating the model parameters of the text matching model to be updated according to the matching result of the marked text to obtain the text matching model.
5. The method according to claim 2, wherein the screening the at least two candidate categories according to the matching results corresponding to the at least two candidate categories respectively to obtain at least two preliminary matching categories that match the data to be annotated includes:
obtaining a matching result reservation condition;
screening at least two kinds of reserved categories meeting the reserved conditions of the matched results from the at least two kinds of candidate categories according to the matched results respectively corresponding to the at least two kinds of candidate categories;
and obtaining at least two preliminary matching categories matched with the data to be marked based on the at least two reserved categories.
6. The method according to claim 1, wherein labeling the targeted data to be labeled based on the at least two preliminary matching categories to obtain labeled data of the targeted data to be labeled, comprises:
acquiring a data annotation model and annotation prompt words;
marking the data to be marked according to the marking prompt word and the at least two preliminary matching categories through the data marking model, and obtaining a marking result of the data to be marked;
And obtaining marked data of the data to be marked based on the marking result.
7. The method of claim 6, wherein the obtaining the data annotation model and the annotation cue comprises:
acquiring a labeling data sample, and determining a candidate labeling model and a candidate prompt word;
labeling the candidate prompt words and the labeling data samples according to the candidate labeling models to obtain sample labeling results of the labeling data samples;
and when the sample labeling result is matched with the category label of the labeling data sample, obtaining a data labeling model according to the candidate labeling model, and obtaining labeling prompt words according to the candidate prompt words.
8. The method of claim 7, wherein when the sample labeling result matches with the category label of the labeled data sample, obtaining a data labeling model according to the candidate labeling model, and obtaining a labeling prompt according to the candidate prompt, comprises:
obtaining a category label of the labeling data sample; the category labels comprise at least two labeling labels which are obtained by labeling the labeling data samples based on different labeling modes;
Determining a first labeling difference between the sample labeling result and the at least two labeling labels and a second labeling difference between the at least two labeling labels respectively;
and when the first annotation difference is out of phase with the second annotation difference, determining the candidate annotation model as a data annotation model, and determining the candidate prompt word as an annotation prompt word.
9. The method of claim 6, wherein the obtaining the annotated data for the data to be annotated based on the annotation result comprises:
obtaining constraint conditions of labeling results;
updating the labeling result based on the labeling result constraint condition to obtain an updated labeling result;
and obtaining marked data of the data to be marked according to the updated marking result.
10. The method according to any one of claims 1 to 9, wherein the updating the marked data set by the marked data of each of the at least one data to be marked and continuing to mark the data to be marked in the data to be marked until the data marking is completed for the data to be marked in the data to be marked, includes:
And adding the marked data of each at least one piece of data to be marked into the marked data set, and returning to execute the step of acquiring the at least one piece of data to be marked from the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
11. A data tagging device, the device comprising:
the data acquisition module is used for acquiring at least one datum to be marked from the datum set to be marked and determining at least two candidate categories;
the preliminary matching module is used for determining at least two preliminary matching categories matched with the data to be marked from the at least two candidate categories aiming at each data to be marked in the at least one data to be marked;
the marking triggering module is used for acquiring the reference category corresponding to the marked data set and determining marking triggering conditions of the reference category; the reference category is determined by counting the marked data in the marked data set; the statistical parameters of the marked data marked as the marked data of the reference category in the marked data set meet the condition of judging a few categories; when the at least two preliminary matching categories comprise the reference category, obtaining a matching result corresponding to the reference category; the matching result corresponding to the reference category is obtained by matching the data to be marked with the reference category; when the matching result corresponding to the reference category meets the marking triggering condition, marking the data to be marked according to the at least two preliminary matching categories, and obtaining marked data of the data to be marked; when the reference category is not included in the at least two preliminary matching categories, acquiring non-minority category labeling conditions; when the matching results corresponding to the at least two preliminary matching categories respectively meet the non-minority category marking conditions, marking the targeted data to be marked based on the at least two preliminary matching categories to obtain marked data of the targeted data to be marked; the matching results corresponding to the at least two preliminary matching categories are obtained by matching the data to be marked with the at least two preliminary matching categories respectively;
And the data set updating module is used for updating the marked data set through the marked data of each of the at least one piece of data to be marked, and continuously marking the data to be marked in the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
12. The apparatus of claim 11, wherein the device comprises a plurality of sensors,
the preliminary matching module is further configured to match the data to be marked with the at least two candidate categories, so as to obtain matching results corresponding to the at least two candidate categories; and screening the at least two candidate categories according to the matching results respectively corresponding to the at least two candidate categories to obtain at least two preliminary matching categories matched with the data to be marked.
13. The apparatus of claim 12, wherein the targeted data to be annotated comprises text to be annotated;
the preliminary matching module is further configured to splice the text to be annotated with the at least two candidate categories, respectively, to obtain spliced texts corresponding to the at least two candidate categories, respectively; acquiring a text matching model; the text matching model is obtained by updating marked data in the marked data set; and respectively performing text matching on the spliced texts respectively corresponding to the at least two candidate categories through the text matching model to obtain matching results respectively corresponding to the at least two candidate categories.
14. The apparatus of claim 13, wherein the device comprises a plurality of sensors,
the primary matching module is also used for acquiring a text matching model to be updated and a marked data set; the marked data in the marked data set comprises marked texts and marking categories aiming at the marked texts; performing text matching on the marked text and the marking category corresponding to the marked text through the text matching model to be updated to obtain a matching result of the marked text; and updating the model parameters of the text matching model to be updated according to the matching result of the marked text to obtain the text matching model.
15. The apparatus of claim 12, wherein the device comprises a plurality of sensors,
the preliminary matching module is also used for acquiring a matching result reservation condition; screening at least two kinds of reserved categories meeting the reserved conditions of the matched results from the at least two kinds of candidate categories according to the matched results respectively corresponding to the at least two kinds of candidate categories; and obtaining at least two preliminary matching categories matched with the data to be marked based on the at least two reserved categories.
16. The apparatus of claim 11, wherein the device comprises a plurality of sensors,
the annotation triggering module is also used for acquiring a data annotation model and annotation prompt words; marking the data to be marked according to the marking prompt word and the at least two preliminary matching categories through the data marking model, and obtaining a marking result of the data to be marked; and obtaining marked data of the data to be marked based on the marking result.
17. The apparatus of claim 16, wherein the device comprises a plurality of sensors,
the annotation triggering module is also used for acquiring an annotation data sample and determining a candidate annotation model and a candidate prompt word; labeling the candidate prompt words and the labeling data samples according to the candidate labeling models to obtain sample labeling results of the labeling data samples; and when the sample labeling result is matched with the category label of the labeling data sample, obtaining a data labeling model according to the candidate labeling model, and obtaining labeling prompt words according to the candidate prompt words.
18. The apparatus of claim 17, wherein the device comprises a plurality of sensors,
the annotation triggering module is further used for acquiring category labels of the annotation data samples; the category labels comprise at least two labeling labels which are obtained by labeling the labeling data samples based on different labeling modes; determining a first labeling difference between the sample labeling result and the at least two labeling labels and a second labeling difference between the at least two labeling labels respectively; and when the first annotation difference is out of phase with the second annotation difference, determining the candidate annotation model as a data annotation model, and determining the candidate prompt word as an annotation prompt word.
19. The apparatus of claim 16, wherein the device comprises a plurality of sensors,
the marking triggering module is also used for acquiring marking result constraint conditions; updating the labeling result based on the labeling result constraint condition to obtain an updated labeling result; and obtaining marked data of the data to be marked according to the updated marking result.
20. The device according to any one of claims 11 to 19, wherein,
the data set updating module is further configured to add the marked data of each of the at least one data to be marked to the marked data set, and return to perform the step of obtaining the at least one data to be marked from the data set to be marked until the data marking is completed for the data to be marked in the data set to be marked.
21. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.
22. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.
23. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 10.
CN202410124620.XA 2024-01-30 2024-01-30 Data labeling method, device, computer equipment and storage medium Active CN117649567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410124620.XA CN117649567B (en) 2024-01-30 2024-01-30 Data labeling method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410124620.XA CN117649567B (en) 2024-01-30 2024-01-30 Data labeling method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117649567A CN117649567A (en) 2024-03-05
CN117649567B true CN117649567B (en) 2024-04-09

Family

ID=90046418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410124620.XA Active CN117649567B (en) 2024-01-30 2024-01-30 Data labeling method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117649567B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687621A (en) * 2022-11-07 2023-02-03 中国农业银行股份有限公司 Short text label labeling method and device
CN116958626A (en) * 2023-01-31 2023-10-27 腾讯科技(深圳)有限公司 Image classification model training, image classification method and device and electronic equipment
WO2023221634A1 (en) * 2022-05-19 2023-11-23 腾讯科技(深圳)有限公司 Video detection method and apparatus, and device, storage medium and program product
CN117194966A (en) * 2022-05-27 2023-12-08 腾讯科技(深圳)有限公司 Training method and related device for object classification model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023221634A1 (en) * 2022-05-19 2023-11-23 腾讯科技(深圳)有限公司 Video detection method and apparatus, and device, storage medium and program product
CN117194966A (en) * 2022-05-27 2023-12-08 腾讯科技(深圳)有限公司 Training method and related device for object classification model
CN115687621A (en) * 2022-11-07 2023-02-03 中国农业银行股份有限公司 Short text label labeling method and device
CN116958626A (en) * 2023-01-31 2023-10-27 腾讯科技(深圳)有限公司 Image classification model training, image classification method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
查询意图自动分类的方法改进探讨;贺国秀 等;《数字图书馆论坛》;20181231;第1-8页 *

Also Published As

Publication number Publication date
CN117649567A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN109783666B (en) Image scene graph generation method based on iterative refinement
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN113704460B (en) Text classification method and device, electronic equipment and storage medium
CN111598153B (en) Data clustering processing method and device, computer equipment and storage medium
CN110750998B (en) Text output method, device, computer equipment and storage medium
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN114443899A (en) Video classification method, device, equipment and medium
CN112131345A (en) Text quality identification method, device, equipment and storage medium
CN112528136A (en) Viewpoint label generation method and device, electronic equipment and storage medium
CN110889505A (en) Cross-media comprehensive reasoning method and system for matching image-text sequences
CN110867225A (en) Character-level clinical concept extraction named entity recognition method and system
CN114372532A (en) Method, device, equipment, medium and product for determining label marking quality
CN116821307B (en) Content interaction method, device, electronic equipment and storage medium
Ávila et al. A gene expression programming algorithm for multi-label classification
CN117649567B (en) Data labeling method, device, computer equipment and storage medium
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN114898426A (en) Synonym label aggregation method, device, equipment and storage medium
CN112861474A (en) Information labeling method, device, equipment and computer readable storage medium
CN115658964B (en) Training method and device for pre-training model and somatosensory wind identification model
CN112131883B (en) Language model training method, device, computer equipment and storage medium
CN115269851B (en) Article classification method, apparatus, electronic device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant