CN114372532A - Method, device, equipment, medium and product for determining label marking quality - Google Patents

Method, device, equipment, medium and product for determining label marking quality Download PDF

Info

Publication number
CN114372532A
CN114372532A CN202210027995.5A CN202210027995A CN114372532A CN 114372532 A CN114372532 A CN 114372532A CN 202210027995 A CN202210027995 A CN 202210027995A CN 114372532 A CN114372532 A CN 114372532A
Authority
CN
China
Prior art keywords
sample data
sample
data
label
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210027995.5A
Other languages
Chinese (zh)
Inventor
刘伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210027995.5A priority Critical patent/CN114372532A/en
Publication of CN114372532A publication Critical patent/CN114372532A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The application discloses a method, a device, equipment, a medium and a product for determining label marking quality, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring target sample data from the sample data set, wherein the target sample data is correspondingly marked with a first sample label; based on the data content similarity between the target sample data and the candidate sample data, determining similar sample data meeting the similarity requirement with the target sample data from the candidate sample data, wherein the similar sample data is correspondingly marked with a second sample label; determining label matching information between the first sample label and the second sample label based on the label similarity between the first sample label and the second sample label; and determining the label labeling quality of the target sample data based on the label matching information. The evaluation process of the label quality is automatically realized, and the quality evaluation efficiency is improved.

Description

Method, device, equipment, medium and product for determining label marking quality
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a medium, and a product for determining label quality.
Background
In the field of Artificial Intelligence (AI), training of a machine learning model includes supervised training and unsupervised training, wherein the supervised training of the machine learning model requires the participation of training samples. The training samples comprise samples for completing label labeling in advance, and the labeling quality of the labels in the training samples often influences the training effect of the downstream machine learning model.
In the related art, the verification method for the training sample label is generally a method of verifying by manual sampling. The manual sampling check is to randomly sample a certain number of samples in the labeled sample data to perform manual secondary recheck, that is, the labeling quality of the label is ensured in a manual recheck mode.
However, the rechecking process of the labeling quality of the training sample label implemented in the above manner depends on manpower, requires more human resources, and has low evaluation efficiency.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment, a medium and a product for determining label marking quality, and the evaluation efficiency of the label marking quality is improved. The technical scheme is as follows:
in one aspect, a method for determining label labeling quality is provided, and the method includes:
acquiring target sample data from a sample data set, wherein the sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
determining similar sample data meeting the similarity requirement with the target sample data from the candidate sample data based on the data content similarity between the target sample data and the candidate sample data, wherein the similar sample data is correspondingly marked with a second sample label, and the candidate sample data is sample data different from the target sample data in the sample data set;
determining label matching information between the first sample label and the second sample label based on label similarity between the first sample label and the second sample label;
and determining the label labeling quality of the target sample data based on the label matching information.
In another aspect, a method for determining label labeling quality is provided, where the method includes:
displaying an interactive interface, wherein the interactive interface is provided with a label quality determining function, and the label quality determining function is used for determining the label labeling quality of sample data in the sample data set;
receiving a data uploading operation in the interactive interface, wherein the data uploading operation is used for uploading the sample data set comprising target sample data to a server, sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
displaying a quality analysis result, wherein the quality analysis result is used for indicating the label labeling quality of the target sample data, the label labeling quality is determined by the server by obtaining label matching information between the first sample label and a second sample label after determining similar sample data meeting the requirement of similarity based on the similarity of data content between the target sample data and candidate sample data, the similar sample data is correspondingly labeled with the second sample label, and the candidate sample data is sample data in the sample data set, which is different from the target sample data.
In another aspect, an apparatus for determining label quality is provided, the apparatus comprising:
the acquisition module is used for acquiring target sample data from a sample data set, wherein sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
a determining module, configured to determine, based on a data content similarity between the target sample data and candidate sample data, similar sample data that meets a similarity requirement with the target sample data from the candidate sample data, where the similar sample data is correspondingly labeled with a second sample tag, and the candidate sample data is sample data in the sample data set that is different from the target sample data;
the obtaining module is further configured to determine, based on the tag similarity between the first sample tag and the second sample tag, tag matching information between the first sample tag and the second sample tag;
the determining module is further configured to determine the label labeling quality of the target sample data based on the label matching information.
In another aspect, an apparatus for determining label quality is provided, the apparatus comprising:
the display module is used for displaying an interactive interface, and the interactive interface is provided with a function for determining the label marking quality;
the receiving module is used for receiving data uploading operation in the interactive interface, the data uploading operation is used for uploading the sample data set comprising target sample data to a server, sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
the display module is further configured to display a quality analysis result, where the quality analysis result is used to indicate tag tagging quality of the target sample data, the tag tagging quality is determined by the server based on data content similarity between the target sample data and candidate sample data, and after similar sample data meeting a similarity requirement is determined, the similar sample data is determined by obtaining tag matching information between the first sample tag and a second sample tag, where the similar sample data is correspondingly tagged with the second sample tag, and the candidate sample data is sample data in the sample data set that is different from the target sample data.
In another aspect, a computer device is provided, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for determining the quality of label labeling according to any one of the embodiments of the present application.
In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the method for determining label quality in any of the embodiments of the present application.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the method for determining the label labeling quality in any of the above embodiments.
The technical scheme provided by the application at least comprises the following beneficial effects:
after the label marking of the sample data is finished, when the quality of the marked label needs to be evaluated, similar sample data with a similar relation with the target sample data is obtained, and the label marking quality corresponding to the target sample data is determined according to the label similarity between the sample label of the target sample data and the sample label of the similar sample data, so that the evaluation process of the label quality of the sample data is automatically realized, and the quality evaluation efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a computer system provided in an exemplary embodiment of the present application;
fig. 2 is a schematic diagram of an architecture between a terminal device and a server according to an exemplary embodiment of the present application;
FIG. 3 is a flowchart of a method for determining label quality according to an exemplary embodiment of the present application;
FIG. 4 is a schematic illustration of a screening of similar sample data provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a tag annotation quality determination process for target sample data according to an exemplary embodiment of the present application;
FIG. 6 is a flowchart of a method for determining label quality according to another exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a front-end interactive interface provided by an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a front-end interactive interface provided by another exemplary embodiment of the present application;
FIG. 9 is a flowchart of a method for determining label quality according to an exemplary embodiment of the present application;
FIG. 10 is a schematic illustration of a text similarity calculation provided by an exemplary embodiment of the present application;
FIG. 11 is a flow chart of label labeling quality assessment provided by an exemplary embodiment of the present application;
FIG. 12 is a flowchart of a method for determining similar sample data provided by an exemplary embodiment of the present application;
FIG. 13 is a schematic diagram illustrating a sample data clustering process provided by an exemplary embodiment of the present application;
FIG. 14 is a schematic diagram of clustering indexes to determine similar sample data as provided by an exemplary embodiment of the present application;
FIG. 15 is a flowchart of a method for generating a contradictory sample library provided by an exemplary embodiment of the present application;
FIG. 16 is a flowchart of contradictory sample re-labeling provided by an exemplary embodiment of the present application;
FIG. 17 is a block diagram of a device for determining quality of label labeling according to an exemplary embodiment of the present application;
fig. 18 is a block diagram of a device for determining quality of label labeling according to another exemplary embodiment of the present application;
fig. 19 is a block diagram of a device for determining quality of label labeling according to another exemplary embodiment of the present application;
fig. 20 is a schematic structural diagram of a server according to an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
First, terms referred to in the embodiments of the present application are briefly described:
artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, smart transportation, and other technologies, and also include common feature Recognition technologies such as face Recognition.
The method for determining the label labeling quality can be used for the quality evaluation process of the labeling labels of training samples in machine learning or deep learning, and can improve the quality evaluation efficiency, wherein the quality evaluation process relates to a natural language processing technology and/or a computer vision technology, and the similarity among sample contents and the matching condition among the sample labels are determined through the technology, so that the evaluation of the labeling quality is completed. Meanwhile, the label quality evaluation of the training samples can also improve the training quality of models in machine learning or deep learning, and the models can be models in various fields such as a natural language processing field, a computer vision field, a voice processing field and the like, and are not limited herein.
Next, an application scenario of the embodiment of the present application will be schematically described.
Firstly, the method is applied to a training data preparation process of machine learning or deep learning model training, namely after the training data is labeled manually or by a machine, the label labeling quality of the training data is determined by the method for determining the label labeling quality provided by the embodiment of the application, so that the sample label with low labeling quality is conveniently rechecked and relabeled. Meanwhile, the method provided by the embodiment of the application can automatically realize the quality evaluation of the sample label, can cover all sample data, finds out the sample data with labeling quality not meeting the requirement, improves the efficiency of the overall quality evaluation, improves the accuracy of the overall quality evaluation, and can improve the accuracy of the downstream model training.
Secondly, the method is applied to the construction process of the label system of the internet content, wherein the internet content is video content, image content, character content, voice content and the like which are released in the internet. In an example, taking a video platform as an example, a video in the video platform is correspondingly labeled with a label, where the label may be labeled by an uploader of the video, or may be automatically labeled by the video platform, and in order to prevent a user from labeling the label at will and affecting the specification of a label system, or prevent the problem that the automatically labeled label has a wrong label, the video and the corresponding label may be subjected to quality evaluation. Illustratively, videos are obtained from a video platform to generate a sample data set, videos with similar contents are screened out through similarity of contents among the videos (for example, similarity detection is performed through video titles or video introduction), tags among the similar videos are compared to obtain corresponding tag matching data, and tagging quality of the video tags is determined according to the tag matching data of the tags among the similar videos, so that auditing personnel can conveniently audit and realize corresponding processing operations, tags tagged by videos with similar contents can be assisted to be unified, and a user can conveniently retrieve videos with similar contents in the video platform.
The implementation environment of the embodiments of the present application is described in conjunction with the above noun explanations and descriptions of application scenarios. As shown in fig. 1, the computer system of the embodiment includes: terminal device 110, server 120, and communication network 130.
The terminal device 110 includes various types of devices such as a mobile phone, a tablet computer, a desktop computer, and a laptop computer. Illustratively, the terminal device 110 runs a target application, and the target application is used for providing a function of evaluating the label labeling quality. Optionally, the target application includes various forms of applications such as a standalone application, a web application, an applet in a host application, and the like, which is not limited herein.
The server 120 is configured to provide back-end support for the target application, that is, provide logical operation support for the tag tagging quality evaluation function.
Illustratively, a user may upload a sample data set to be subjected to quality evaluation to the server 120 through an interaction interface provided by a target application in the terminal device 110, and indicate a corresponding sample similarity threshold, after receiving the sample data set, the server 120 processes the sample data in the sample data set and corresponding tags according to the provided sample similarity threshold to obtain sample data having a similarity relationship, determines tag tagging quality corresponding to the sample data in the sample data set for tag similarity of the sample tags among the sample data having the similarity relationship, and returns the tag tagging quality to the terminal device 110. After receiving the label tagging quality, the terminal device 110 may display the label tagging quality in an interactive interface of the target application, where the display content may include a sample data identifier corresponding to the target sample data and the label tagging quality, in an example, the label tagging quality may be data within a range of [0,1], and the closer the label tagging quality is to 1, the higher the label tagging quality is.
In some embodiments, the tag quality evaluation function may also be implemented by the terminal device 110 alone if the computing power of the terminal device 110 satisfies the overall operation of the tag quality evaluation function logic.
It should be noted that the server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
The Cloud Technology (Cloud Technology) is a hosting Technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
In some embodiments, the server 120 described above may also be implemented as a node in a blockchain system. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise a management module, a basic service, an intelligent contract, an operation and other processing modules. The management module is responsible for identity information management of all blockchain participants, and the identity information management comprises maintenance of public and private key generation (account management), key management and the like; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation.
Illustratively, the terminal device 110 and the server 120 are connected through a communication network 130, where the communication network 130 may be a wired network or a wireless network, and is not limited herein.
In addition, please refer to fig. 2, which illustrates an architecture diagram between a terminal device and a server according to an exemplary embodiment of the present application, wherein a target application 211 in the terminal device 210 is a web application, that is, the target application 211 runs and displays a front-end page through a web browser, a connection is established between the terminal device 210 and the server 220 through a hypertext Transfer Protocol (HTTP) or a hypertext Transfer security Protocol (HTTP) with the server 220, the terminal device 210 transmits a data file 201 including a sample data set through the connection, and the server 220 transmits a result file 202 storing tagged quality data through the link. For example, a Browser/Server Architecture (B/S Architecture) is only used as an example, and a Server/Client Architecture (C/S Architecture) may be used therebetween, which is not limited herein.
Based on the foregoing implementation environment and architecture schematic, please refer to fig. 3, which shows a flowchart of a method for determining tag annotation quality according to an embodiment of the present application, in the embodiment of the present application, the method is applied to a server as shown in fig. 1, and of course, the method may also be applied to a terminal device, which is only schematically illustrated here and is not limited to a specific implementation subject. The method comprises the following steps:
301: and acquiring target sample data from the sample data set.
Sample data in the sample data set is correspondingly marked with a sample label, and target sample data is correspondingly marked with a first sample label. In some embodiments, the sample label corresponding to the sample data may be manually labeled or machine labeled, for example, the label labeling process of the sample data is completed through a label labeling model. Optionally, in the process of labeling the sample data, the label corresponding to the sample data may be a designated label, that is, the labeling personnel or the labeling machine selects the label corresponding to the sample data from designated label options; alternatively, the label corresponding to the sample data may be a non-specified label, for example, the annotating personnel annotate the sample data with subjective judgment, in which case, the sample labels corresponding to similar sample data may not be identical.
Optionally, the sample data set may be read from a database of the server, for example, after the server completes automatic tagging on the sample data in the sample data set, the sample data set is stored in the database, and when a tagging quality evaluation request for the sample data set is received, the server reads the sample data set from the database; alternatively, the sample data set may be received from a terminal device, for example, a user uploads the sample data set to be subjected to quality evaluation to a server through the terminal device.
The target sample data is a single sample data currently subjected to label labeling quality determination by the server, and may be any sample data in the sample data set. Optionally, the target sample data may be sample data specified by the terminal device, that is, when the terminal device instructs the server to perform annotation quality evaluation, the specified server evaluates the tagging quality corresponding to the target sample data in the sample data set, for example, when sending an annotation quality evaluation request to the server, the annotation quality evaluation request carries a sample data Identifier (ID) corresponding to the target sample data; or, the server determines the label labeling quality of all sample data in the sample data set, and the target sample data is the sample data traversed by the server currently; or, the target sample data is a certain amount of sample data obtained by the server through random sampling in the sample data set, for example, 30% of sample data in the sample data set is randomly extracted as the target sample data.
Alternatively, the data format of the sample data in the sample data set may be any format, such as a text format, an image format, or a voice format, and the data format of the corresponding sample tag may also be any format, such as a text format, an image format, or a voice format, which is not limited herein. Optionally, the number of sample tags corresponding to one sample data may be one, that is, the training task using the sample data is a single classification task; alternatively, the number of sample labels corresponding to one sample data may be multiple, that is, the training task using the sample data is a multi-classification task.
302: and determining similar sample data meeting the similarity requirement with the target sample data from the candidate sample data based on the data content similarity between the target sample data and the candidate sample data.
The similar sample data is correspondingly marked with a second sample label.
The candidate sample data is sample data different from the target sample data in the sample data set. Alternatively, the number of candidate sample data may be one or more, and is not limited herein.
Alternatively, the candidate sample data may be all sample data in the set of sample data except the target sample data. Or, the candidate sample data may be candidate sample data obtained by screening sample data in the sample data set in a target screening manner. In an example, the target screening method may be to cluster sample data in the sample data set in advance to obtain a plurality of sample clusters, and use all sample data or part of sample data except the target sample data in the sample cluster to which the target sample data belongs as the candidate sample data.
In some embodiments, the data content similarity between the target sample data and the candidate sample data is represented by calculating similarity data between the sample data. Illustratively, when the similarity of the data content of the sample data is determined, the sample data in different data forms may determine the candidate sample data through different similarity data, where the similarity data is used to indicate the similarity of the data content between the target sample data and the candidate sample data.
In one example, when the data form of the sample data is a text form, the similarity data may be obtained from at least one of sentence semantic similarity, word semantic similarity, character similarity, and the like between text contents of the sample data. The sentence semantic similarity is data determined by comparing semantic similarities between different sentences between two sample data after text contents are divided by sentences; the word segmentation similarity is data determined by comparing semantic similarities between different segmented words between two sample data after the text content is divided according to words; the character similarity is data determined by comparing the similarity between constituent characters of text contents between two sample data.
In another example, when the data form of the sample data is an image form, the similarity data may be obtained from at least one of a histogram distribution similarity, an image feature similarity, a pixel array similarity, and the like between image contents of the sample data. The histogram distribution similarity is determined by comparing the distribution similarity of histograms between image contents corresponding to the sample data; the image feature similarity is determined by calculating feature angle data or feature distance data between image features corresponding to different image contents after feature extraction is performed on the image contents; the pixel array similarity is data determined by calculating the similarity between gray value arrays corresponding to different image contents after converting pixels corresponding to the image contents into the gray value arrays.
In another example, when the data form of the sample data is a voice form, the similarity data may be obtained from at least one of an acoustic similarity, a textual semantic similarity, an audio feature similarity, and the like between voice contents of the sample data. The acoustic similarity is data determined by dividing the voice content into phonemes and comparing the similarity between the phonemes corresponding to different voice contents; the text-to-text semantic similarity is data determined by comparing semantic similarities between different text contents after converting the voice contents into the text contents; the audio feature similarity is data determined by calculating feature angle data or feature distance data between different audio features after converting the voice content into the audio features.
In some embodiments, the similarity requirement may indicate that the similarity between the target sample data and the candidate sample data meets a sample similarity threshold. Illustratively, the data content similarity between the target sample data and the candidate sample data determines the similarity data between the target sample data and the candidate sample data, and the similar sample data is determined from the candidate sample data according to the comparison condition between the similarity data and the sample similarity threshold. And the similarity requirement is used for screening sample data with a data content similarity relation with the target sample data from the candidate sample data.
Optionally, the sample similarity threshold may be preset by the system, or may be indicated by the user through the terminal device. In some embodiments, the similarity requirement may further indicate that all candidate sample data are sorted according to the determined similarity data, and the candidate sample data of the target proportion in the obtained sample sequence is determined as the similar sample data, for example, the candidate sample data is sorted in a descending order according to the similarity, and the top 20% of the candidate sample data in the obtained sample sequence is used as the similar sample data.
Schematically, as shown in fig. 4, a screening diagram of similar sample data provided by an exemplary embodiment of the present application is shown. Similarity calculation is performed on the target sample data 410 and the candidate sample data 420 to obtain similarity data 430 corresponding to each candidate sample data 420 and the target sample data 410, and similar sample data 440 is obtained by screening from the candidate sample data 420 according to the similarity data 430.
303: and determining label matching information between the first sample label and the second sample label based on the label similarity between the first sample label and the second sample label.
The tag matching information is used to indicate a matching degree between the first sample tag and the second sample tag, and illustratively, the tag matching information is determined by a tag similarity between the first sample tag and the second sample tag.
Alternatively, the tag matching information between the first sample tag and the second sample tag may be determined by calculating tag similarity between the first sample tag and the second sample tag, for example, by performing feature extraction on the first sample tag to obtain a first tag feature vector, performing feature extraction on the second sample tag to obtain a second tag feature vector, determining the tag similarity between the first sample tag and the second sample tag by calculating a feature distance and/or a feature angle between the first tag feature vector and the second tag feature vector, and determining the tag matching information by the tag similarity. The method for determining the label matching information through the label similarity can be suitable for the condition that the label is a non-specified label in the labeling process of the sample data.
Optionally, the tag matching information between the first sample tag and the second sample tag may also be determined by comparing the first sample tag with the second sample tag to determine whether the two tags are consistent (i.e., the tag similarity between the sample tags needs to reach 100%), that is, by comparing the first sample tag with the second sample tag, the tag consistency between the first sample tag and the second sample tag is determined, and the tag matching information is determined by the tag consistency.
In some embodiments, the tag matching information is obtained by jointly calculating similarity data between sample data and matching degree data between sample tags. Illustratively, similarity data between target sample data and similar sample data is acquired, matching data between a first sample label and a second sample label is acquired, the matching data is mapped to matching weight data corresponding to the similar sample data, and label matching information is determined based on the matching weight data and the similarity data. The similarity data is determined by the similarity of data contents between the target sample data and the similar sample data, the matching degree data is determined by the similarity of labels between the first sample label and the second sample label, and the matching weight data is used for indicating the contribution condition of the similarity data to the label matching information under the matching degree data. Alternatively, the matching degree data between the sample tags may be the tag similarity or the tag identity (in the case where the similarity reaches 100%).
The label matching information corresponding to the sample data is determined through the similarity between the sample data and the matching degree of the corresponding label, the standard data does not need to be acquired additionally for evaluating the label quality, the label quality is evaluated directly from the interior of the sample data, and the evaluation efficiency of the label marking quality is improved.
In one example, when the matching degree data among the sample labels is the label similarity, the label similarity is normalized to a value between 0 and 1 and is used as matching weight data, and the obtained matching weight value and the similarity data are multiplied to determine corresponding label matching information; or mapping the tag similarity to corresponding matching weight data according to the interval to which the normalized tag similarity belongs, for example, as shown in table one, each tag similarity interval corresponds to the matching weight data one by one, and the matching weight data is determined according to the interval into which the normalized tag similarity data falls.
Watch 1
Tag similarity interval Matching weight data
[0,0.25] 0
(0.25,0.75) 0.5
[0.75,1] 1
In another example, when the matching degree data between the sample labels is the above-described label matching degree, the matching weight data mapped as one according to whether the sample labels are matched with each other, for example, when the first sample label is the same as the second sample label, the matching weight data corresponds to "1", and when the first sample label is different from the second sample label, the matching weight data corresponds to "0", the determined matching weight data and the similarity data are multiplied to determine the corresponding label matching information.
304: and determining the label labeling quality of the target sample data based on the label matching information.
In some embodiments, when the target sample data corresponds to a plurality of similar sample data, the label tagging quality corresponding to the target sample data is determined by the label matching information corresponding to each of the plurality of similar sample data.
In one example, the label labeling quality s corresponding to the target sample data is calculated through a formula IiWherein, the target sample data is the ith sample data in the sample data set, M is the number of similar sample data having a similar relation with the target sample data, sim (i, j) is used for representing the similarity data between the target sample data and the jth similar sample data, wijFor representing the matching degree data between the target sample data and the jth similar sample data, for example, when the matching degree data is the label consistency between the sample labels, and when the first sample label and the second sample label are the same, wijWhen the first and second swatch labels are different, w 1ij0. Namely, the label matching information corresponding to each similar sample data is summed, and is divided by the sum of the similarity data corresponding to all similar sample data, and finally the label labeling quality of the target sample data is determined.
The formula I is as follows:
Figure BDA0003465151050000141
in some embodiments, when each sample data corresponds to only one sample tag, that is, when the sample data belongs to a single classification task, the tagging quality is determined by tag matching information corresponding to a unique tag of the sample data. When the sample data corresponds to a plurality of sample labels, that is, when the sample data belongs to a multi-classification task, the sample labeling quality may be determined by calculating each classification as a single classification task to determine the label labeling quality corresponding to the classification, and then performing weighted average on the label labeling quality corresponding to each classification according to the classification weight corresponding to each classification to obtain the comprehensive label labeling quality.
In some embodiments, when the target sample data is labeled with at least two first sample tags, that is, under the condition of a multi-classification task, the label labeling quality corresponding to the target sample data determines the comprehensive label labeling quality corresponding to the target sample data according to the label labeling quality corresponding to the plurality of sample tags. Schematically, acquiring a label weight relationship between at least two first sample labels, where the label weight relationship is used to indicate a weight relationship between classification tasks corresponding to the at least two first sample labels, respectively; and carrying out weighted summation on the label labeling quality corresponding to at least two first sample labels based on the label weight relationship to obtain the comprehensive label labeling quality of the target sample data. Namely, for multi-classification tasks, the comprehensive label labeling quality of the sample data for all labels can be determined according to the weight relation among different classification tasks, and the accuracy of label labeling quality determination is improved.
In one example, for the above classification weights, when the classes belong to the same class, that is, for different angle classes, the corresponding weights of the different classes are the same; in another example, when the classes belong to upper and lower classes, i.e., the label of each class in turn is a refined class of the label of the previous class, e.g., the first class label is "animal", the second class label is "mammal", the third class label is "feline", the fourth class label is "cat", and the weight of each class increases from top to bottom.
Schematically, as shown in fig. 5, a schematic diagram of a tag labeling quality determination process of target sample data provided in an exemplary embodiment of the present application is shown. Determining matching degree data A531 among the labels by using a first sample label of the target sample data 510 and a second sample label A of the similar sample data A521, then mapping the matching degree data A531 into matching weight data A541, and determining label matching information A561 between the target sample data 510 and the similar sample data A521 by using the similarity data A551 between the target sample data 510 and the similar sample data A521 and the matching weight data A541; similarly, the matching degree data C532 between the tags is determined by the first sample tag of the target sample data 510 and the second sample tag a of the similar sample data C522, then the matching degree data C532 is mapped to be the matching weight data C542, the tag matching information C562 between the target sample data 510 and the similar sample data C522 is determined by the similarity data C552 between the target sample data 510 and the similar sample data C522 and the matching weight data C542, and finally the tag labeling quality 570 corresponding to the target sample data is obtained by jointly calculating through the similarity data a551, the similarity data C552, the tag matching information a561 and the tag matching information C562.
In some embodiments, after determining the tagging quality of the target sample data, it may be determined whether the sample data in the sample data set needs to be tagged again through set quality data corresponding to the sample data set.
Illustratively, set quality data of a sample data set is obtained based on label labeling quality of target sample data, in response to the fact that the set quality data meets set quality conditions, the sample data in the sample data set is input to a model to be trained, a training prediction result is output, iterative training is conducted on model parameters of the model to be trained based on a loss value between the training prediction result and a sample label of the sample data, a target model is obtained, and the target model is used for completing a classification task or a regression analysis task, namely the target model is obtained through supervised training.
Optionally, the set quality data may be obtained by labeling the quality of the label corresponding to all sample data in the sample data set, or the sample set may be sampled to obtain target sample data of a target quantity, and the set quality data may be determined by labeling the quality of the label corresponding to the target sample data of the target quantity. Alternatively, the set quality condition may be a user-defined indication set quality threshold.
In one example, taking the target model as the text emotion classification model as an example, the sample data in the sample data set is sample text data, and the sample label of the sample data is used for indicating the emotion category of the text content of the sample text data, that is, the target model can classify the emotion category of the input target text content and output the text emotion corresponding to the target text content, for example, the input target text content is "today is really happy one day and one night! "if the text emotion recognition is performed by the target model, the output emotion type is" positive ".
Illustratively, in the process of training a model to be trained through sample data of a sample data set, feature extraction is performed on sample text data through the model to be trained to obtain emotional feature representation, a training prediction result is determined from candidate emotional categories based on a matching relation between the emotional feature representation and the candidate emotional categories, iterative training is performed on model parameters of the model to be trained based on a loss value between the training prediction result and sample labels of the sample text data to obtain a text emotion classification model, and the text emotion classification model is used for identifying input target text content and determining the emotional category of the target text content.
Specifically, a target loss value between a training prediction result and a sample label of sample text data is obtained, iterative training is performed on model parameters of a model to be trained in response to failure of matching of the target loss value and a preset loss range, or a text emotion classification model is obtained in response to the fact that the target loss value meets the preset loss range, and the text emotion classification model is used for recognizing input target text content and determining emotion types of the target text content.
The target loss value is calculated by a preset loss function, and the preset loss function may be at least one of a perceptual loss function, an exponential loss function, a cross entropy loss function, and the like.
In another example, when the target model is a road image recognition model, for example, the sample data in the sample data set is sample image data, and the sample label of the sample data is used to indicate the road condition category of the image content of the sample image data, that is, the target model can perform road condition recognition on the input target image content and output the road condition category corresponding to the target image content, and may be applied to the fields of automatic driving, intelligent transportation, and the like.
Illustratively, in the process of training a model to be trained through sample data of a sample data set, feature extraction is performed on sample image data through the model to be trained to obtain a road feature representation, a training prediction result is determined from candidate road condition categories based on a matching relation between the road feature representation and the candidate road condition categories, iterative training is performed on model parameters of the model to be trained based on a loss value between the training prediction result and sample labels of the sample image data to obtain a road image recognition model, and the road image recognition model is used for recognizing input target image content and determining road condition categories corresponding to the target image content.
Specifically, a target loss value between a training prediction result and a sample label of sample image data is obtained, iterative training is performed on model parameters of a model to be trained in response to failure of matching between the target loss value and a preset loss range, or a road image recognition model is obtained in response to the condition that the target loss value meets the preset loss range, and the road image recognition model is used for recognizing input target image content and determining a road condition category corresponding to the target image content.
It should be noted that, the above description only takes the target model as the text emotion classification model and the road image recognition model as examples, and the target model may be any model that can be obtained through supervised training, such as a facial recognition model, a medical image recognition model, a speech-to-text recognition model, and the like, and is not limited herein.
To sum up, according to the method for determining label labeling quality provided by the embodiment of the application, after the label labeling of the sample data is completed, when the labeled label needs to be subjected to quality evaluation, similar sample data having a similar relationship with the target sample data is obtained, and the label labeling quality corresponding to the target sample data is determined according to the label similarity between the sample label of the target sample data and the sample label of the similar sample data, so that the evaluation process of the label quality of the sample data is automatically realized, and the quality evaluation efficiency is improved.
Referring to fig. 6, a flowchart of a method for determining label labeling quality according to an exemplary embodiment of the present application is shown, in which an implementation process of a client side in a terminal device is schematically illustrated. The method comprises the following steps:
601: and displaying an interactive interface, wherein the interactive interface is provided with a label quality determination function.
And the label quality determining function is used for determining the label labeling quality of the sample data in the sample data set.
In some embodiments, the interactive interface is an interface in the target application, and optionally, the interactive interface may be an interface in an independent application program, or may be a web interface displayed by a browser.
In the embodiment of the application, in the process of determining the label labeling quality, the terminal device undertakes the function of uploading sample data of a foreground and the function of displaying a quality analysis result, and the server undertakes the function of evaluating the label labeling quality of a background sample.
Illustratively, an upload control of a data file is provided in the interactive interface, and a user may determine a sample data set to be uploaded from a storage area of the terminal device through the upload control. Optionally, the user may select a data file path corresponding to the sample data set by triggering a pop-up window displayed by the upload control, or may input the data file path corresponding to the sample data set into the upload control.
In some embodiments, the data file may be in a Tab-Separated Values (TSV) format, a Comma-Separated Values (CSV) format, or other format-supported files.
In one example, taking the data file as the TSV format as an example, a sample of the data file is shown in table two, where each row is a sample data and is divided into three columns in total, the first column is a sample data ID and may be any nonrepeating character or character string, the second column is a label corresponding to the sample data and has a labeled character string, the third column is specific text content corresponding to the sample data and has a character string with any length, and the three columns are separated by a keyboardbook (Tab) character.
Watch two
Figure BDA0003465151050000181
Illustratively, the user may also set a sample similarity threshold through the interactive interface, where the sample similarity threshold is used to filter sample data having a similarity relationship from the sample data set. For example, if the threshold of the sample similarity uploaded by the terminal device is 0.75, the server determines, as similar sample data, sample data whose similarity between the sample data reaches 0.75 when performing processing.
602: and receiving data uploading operation in the interactive interface.
The data uploading operation is used for uploading a sample data set comprising target sample data to a server. In an example, the interactive interface further includes an evaluation control, when the evaluation control receives a trigger signal, it is determined that the data uploading operation is received, the terminal device generates a quality evaluation request according to the data file determined by the uploading control, and sends the quality evaluation request to the server, where the quality evaluation request includes the data file and the sample similarity threshold.
The sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label. Optionally, the architecture adopted between the terminal device and the server may be a C/S architecture or a B/S architecture, which is not limited herein.
603: and displaying the quality analysis result.
The quality analysis result is used for indicating the label labeling quality of the target sample data, the label labeling quality is determined by the server based on the data content similarity between the target sample data and the candidate sample data and by obtaining label matching information between the first sample label and the second sample label after the similar sample data meeting the similarity requirement is determined, the similar sample data is correspondingly labeled with the second sample label, and the candidate sample data is sample data different from the target sample data in the sample data set. The process of determining the label labeling quality is the same as steps 301 to 304, and is not described herein again.
In the embodiment of the application, the terminal device displays the quality analysis result after receiving the quality analysis result.
Alternatively, the quality analysis result may be an analysis result of target sample data (an analysis result for a single sample data) or an analysis result of a sample data set (an analysis result for the entire sample data set).
Illustratively, taking the quality analysis result as an analysis result of target sample data as an example, the quality analysis result sent by the server to the terminal device includes a sample data identifier corresponding to the target sample data and a corresponding label marking quality; taking the quality analysis result as the analysis result of the sample data set as an example, the quality analysis result sent by the server to the terminal includes set quality data obtained by traversing the sample data in the sample data set as target sample data to respectively determine the label labeling quality and then performing comprehensive calculation.
In one example, the server sends the set quality data to the terminal device as a quality analysis result, and after the terminal device displays the set quality data in the interactive interface, the user can determine whether to further acquire the label marking quality corresponding to each sample data in the sample data set from the server according to the set quality data.
Schematically, as shown in fig. 7, a schematic diagram of a front-end interactive interface provided by an exemplary embodiment of the present application is shown. The interactive interface 700 of the target application includes an upload control 711 of the data file, and after the upload control 711 receives a click operation, a user may select a path of the data file 701 through a displayed popup window 720 to upload the file. Also included in the interactive interface 700 is an input control 712 corresponding to the similarity threshold, where a default similarity threshold (0.75) is displayed in the input control 712, and a user may indicate a different similarity threshold by modifying data in the input control 712. The interactive interface 700 further includes an evaluation control 713, when the evaluation control 713 receives a click operation, the terminal device generates a corresponding quality evaluation request according to the uploaded data file and the similarity threshold, sends the quality evaluation request to the server, returns set quality data corresponding to a sample data set to the terminal device after the server completes an evaluation process of labeling quality of a tag of the sample data set in the data file, and the terminal device displays the set quality data in a result display area 714 in the interactive interface 700.
When a user needs to download a specific result corresponding to each sample data in the sample data set, the user can request the server to acquire a complete quality analysis result, the server marks the quality according to the label corresponding to each sample data in the sample data set to generate a result file, and the result file is sent to the terminal equipment.
In an example, the result file and the data file are files in the same format, and specifically, as shown in table three, a result file example provided by an exemplary embodiment is shown, where each row corresponds to an analysis result of one sample data, each row is divided into three columns, a first column is a sample data ID corresponding to the sample data in the data file, a second column is a similar sample data ID corresponding to similar sample data, and a third column is a label quality corresponding to a label of the sample data.
Watch III
Figure BDA0003465151050000201
In an example, as shown in fig. 8, which illustrates a schematic diagram of a front-end interactive interface provided in another exemplary embodiment of the present application, a result downloading control 815 is included in the interactive interface 800, and after the interactive interface 800 displays the set quality data according to a user operation, a user may click the downloading control 815 to complete downloading of a result file, and store the result file in a specified path, or display file content 816 corresponding to the result file.
To sum up, the method for determining label labeling quality provided in the embodiment of the present application sends the sample data set to the server through the interactive interface after the sample data completes label labeling and when the labeled label needs to be quality-evaluated, and after the server completes determination of the label labeling quality of the sample data, returns the corresponding quality analysis result to the terminal device, and displays the result by the terminal device, thereby implementing visualization of the label labeling quality, enabling a user to conveniently and quickly obtain the label labeling quality corresponding to the sample data, and completing rechecking work of the label labeling.
Referring to fig. 9, a flowchart of a method for determining label quality according to an exemplary embodiment of the present application is shown, in which a data format of sample data is taken as a text format as an example, and a process of determining label quality is schematically illustrated in the embodiment of the present application. The method comprises the following steps:
901: receiving a quality evaluation request sent by terminal equipment, wherein the quality evaluation request comprises a data file and a sample similarity threshold, and the data file is used for storing a sample data set.
The quality evaluation request is used for requesting the server to evaluate according to the corresponding data file and the sample similarity threshold value.
In the embodiment of the application, the sample data set is provided by the terminal equipment. In some embodiments, the user uploads the data file to the server through the terminal device by saving the sample data set in the data file. Optionally, the architecture adopted between the terminal device and the server may be a C/S architecture or a B/S architecture, which is not limited herein.
In this embodiment, the terminal device further provides a sample similarity threshold to the server, where the sample similarity threshold is used to filter sample data having a similarity relationship from a sample data set. In one example, when the similarity data s between the target sample data and the candidate sample data is greater than or equal to the sample similarity threshold t, the candidate sample data is determined to be similar sample data.
902: and responding to the traversal of the sample data set, and acquiring target sample data from the sample data set.
Illustratively, the server analyzes the received data file to obtain a sample data set, where the sample data set includes a sample data ID and corresponding sample content. In some embodiments, after the server parses the sample data set, all data in the sample data set are added to a preset text library, where the preset text library is used to store the sample data in the sample data set, and the server is convenient to traverse the sample data in the sample data set.
In the embodiment of the application, the server traverses the sample data in the sample data set as target sample data to determine the label labeling quality corresponding to each sample data. When the server analyzes the data file, the arrangement sequence of the sample data in the data file is reserved, and in the traversing process, the traversing is performed according to the arrangement sequence of the sample data in the data file.
903: and acquiring similarity data between the target sample data and the candidate sample data.
The similarity data is used for indicating the similarity of data contents between the target sample data and candidate sample data, and the candidate sample data is sample data different from the target sample data in the sample data set.
In the embodiment of the application, because the sample data in the sample data set is data in a text form, the semantic similarity between the target sample data and the candidate sample data can be calculated to serve as the similarity data, so that the screening process of the candidate sample data is realized.
Illustratively, a first feature representation corresponding to target sample data and a second feature representation corresponding to candidate sample data are obtained, and the similarity data are determined based on the difference situation between the first feature representation and the second feature representation.
The first feature representation and the second feature representation are obtained by inputting sample data to an encoder model and performing feature extraction, and the form of the feature representation may be a vector form or a matrix form, which is not limited herein. Alternatively, the Encoder Model may be a Pre-Training language Model in the field of natural language processing, such as a semantic Representation Model (BERT), a Generative Pre-Training Model (GPT), an enhanced BERT Model (a robust Optimized BERT Pre-Training application, RoBERTa), and any other Model capable of performing text semantic feature extraction.
In the embodiment of the present application, only the sample content is taken as an example for explanation, and when the sample content is an image content, the encoder model for feature extraction may be a pre-trained image model in the field of computer vision processing, for example, a Visual Geometry Group Network (VGG) model, a Dense link Network (densneet) model, a Slow Fast Network (Slow Fast Network, Slow Fast) model, a Long Short-Term Memory Network (LSTM) model, and the like. When the sample content is speech content, the encoder model for feature extraction may be a pre-trained speech model in the audio processing field, for example, a network model, an acoustic model, a language model, and the like obtained based on a Multi-head Attention Mechanism (MHA).
Optionally, after the first feature representation and the second feature representation are extracted, similarity data may be determined based on angle data between the first feature representation and the second feature representation; and/or determining similarity data based on distance data between the first feature representation and the second feature representation.
The angle data is used for indicating the condition of an included angle formed by the first feature representation and the second feature representation in the feature space, the angle data can be used for determining the similarity between the features by calculating a cosine value (cosine distance) of the included angle between the first feature representation (vector) and the second feature representation (vector), and the cosine value can be replaced by a sine value, a tangent value and the like. The distance data is used to indicate the distance between the first feature representation and the second feature representation in the feature space, and the distance data may be used to determine the similarity between features by calculating at least one feature distance of euclidean distance, manhattan distance, chebyshev distance, hamming distance, mahalanobis distance, and the like between the first feature representation (vector) and the second feature representation (vector).
In one example, as shown in fig. 10, a schematic diagram of a text similarity calculation provided by an exemplary embodiment of the present application is shown. Wherein, the text content t corresponding to the target sample datai1011 are input into the text coder model 1001 and output to obtain a corresponding first feature representation v i1012, text content t corresponding to candidate sample dataj1021 is input to text coder model 1001, and the output results in corresponding second feature representation v j1022, first feature representation v i1012 and second characterization v j1022 is input into the similarity calculation module 1002, and finally, the similarity data s1013 between the target sample data and the candidate sample data is output.
904: in response to the similarity data reaching a sample similarity threshold, the candidate sample data is determined to be similar sample data.
And the server compares the similarity data between the target sample data and the candidate sample data with a sample similarity threshold, if the similarity data reaches the sample similarity threshold, the server determines that the target sample data and the candidate sample data have a similarity relation, and determines the candidate sample data as similar sample data.
The target sample data is correspondingly marked with a first sample label, and the similar sample data is correspondingly marked with a second sample label.
905: and acquiring label matching information between the first sample label and the second sample label.
The label matching information is used to indicate the degree of matching between the first sample label and the second sample label.
In the embodiment of the application, the matching information of the labels between the first label sample and the second label sample is determined by comparing the consistency between the first sample label and the second sample label. Illustratively, when the tag matching information is determined, the tag matching information is determined by mapping the tag matching information to matching weight data corresponding to similar sample data and by determining similarity data between the matching weight data and the sample data. For example, when the first sample label is the same as the second sample label, the mapped matching weight data is "1", and the matching weight data is multiplied by the obtained similarity data, thereby determining the label matching information.
906: and determining the label labeling quality of the target sample data based on the label matching information.
In the embodiment of the present application, the label labeling quality is determined by integrating the label matching data corresponding to all similar sample data having a similar relationship with the target sample data, and optionally, the label labeling quality may be an average value of the label matching data corresponding to all similar sample data, or may be data determined by summing the label matching information corresponding to each similar sample data and dividing the sum by the sum of the similarity data corresponding to all similar sample data.
907: and responding to the completion of traversal of the sample data set, and acquiring the label marking quality corresponding to each sample data in the sample data set.
And after the server finishes traversing all the sample data in the sample data set and obtains the corresponding label marking quality, the server sorts the label marking quality corresponding to all the sample data, generates a result file, and correspondingly stores the result file and the set identifier corresponding to the sample data set in the database.
908: and taking the mean value of the label marking quality corresponding to the sample data in the sample data as set quality data of the sample data set.
After the server completes the result file, set quality data corresponding to the sample data set needs to be generated, and the set quality data is used for uniformly evaluating the label labeling quality of the sample data in the sample data set.
In the embodiment of the application, the server obtains the mean value between the label labeling qualities corresponding to the sample data in the sample data set as the set quality data.
909: and returning the set quality data to the terminal equipment.
And the server returns the set quality data to the terminal equipment through the connection relation between the server and the terminal equipment, and the terminal equipment displays the set quality data in the target application after receiving the set quality data.
910: and receiving a result file downloading request sent by the terminal equipment.
When a user needs to download a specific result corresponding to each sample data in the sample data set, the result file can be downloaded through the result file downloading request. The result file downloading request comprises a set identifier corresponding to the currently specified sample data set, and the server queries in the database through the set identifier to obtain a corresponding result file.
911: and sending the result file to the terminal equipment.
And the server transmits the result file to the terminal equipment through the connection relation between the result file and the terminal equipment.
Referring to fig. 11, a flow chart of label labeling quality evaluation provided by an exemplary embodiment of the present application is shown, where the flow chart includes the following steps: loading a data file and a similarity threshold t (1101); adding all sample data to a text base (1102); let i equal 1 (1103); matching the target sample data of the ith row in a text library, and selecting sample data with the similarity greater than t (1104); judging whether similar sample data exists or not (1105), if not, executing (1106), and if yes, executing (1107); setting the labeling quality score s of the sample data label of the ith row as 1.0 (1106); calculating a labeling quality score s (1107) of the sample data in the ith row and all similar sample data; making i ═ i +1 (1108); judging whether i is greater than the number of lines of the data file (1109), if so, executing (1110), and otherwise, executing (1104); calculating an overall tagging quality score (1110); the overall labeling quality score is output (1111).
To sum up, according to the method for determining label labeling quality provided by the embodiment of the application, after the label labeling of the sample data is completed, when the labeled label needs to be subjected to quality evaluation, similar sample data having a similar relation with the target sample data is obtained, and the label labeling quality corresponding to the target sample data is determined according to the matching degree between the sample label of the target sample data and the sample label of the similar sample data, so that the evaluation process of the label quality of the sample data is automatically realized, and the quality evaluation efficiency is improved.
In the embodiment of the application, similar sample data corresponding to target sample data is determined by detecting the similarity between the sample data, then label marking quality is determined according to the consistency of sample labels between the sample data and the similarity data, and the label marking quality corresponding to each sample data in the sample data set is integrated to feed back set quality data corresponding to the sample data set to the terminal equipment, so that a user can determine whether to continue to obtain detailed data according to the set quality data. For example, when the aggregate quality data meets the annotation quality requirement, the user does not need to acquire detailed data, and the data transmission quantity between the server and the terminal is reduced. And after the server determines the label marking quality corresponding to each sample data, a corresponding result file is generated, when the user needs a detailed evaluation result, the server can also perform quick response, and the integrity of the user for acquiring data is improved.
Referring to fig. 12, a flowchart of a method for determining similar sample data according to an embodiment of the present application is shown, in the embodiment of the present application, by clustering sample data in a sample data set in advance, the retrieval efficiency of target sample data when determining corresponding similar sample data is reduced. The method comprises the following steps:
1201: and acquiring target sample data from the sample data set.
Optionally, the target sample data may be partial sample data in the sample data set, for example, sample data specified by the terminal device and required to evaluate the label quality; or, the target sample data may be sample data for which the server needs to perform label marking quality evaluation at the current time when the server passes through the sample data in the sample data set.
1202: and determining the recalled sample data corresponding to the target sample data from the candidate sample data based on the clustering condition among the sample data in the sample data set.
In the embodiment of the application, before label labeling quality evaluation is performed on target sample data, clustering processing is performed on the sample data in the sample data set, and the sample data in the sample data set is segmented according to a certain sample similarity to obtain a plurality of sample clusters, so that the sample similarity between the sample data in the same sample cluster is as large as possible, the sample similarity corresponding to the sample data in different sample clusters is as small as possible, the similar sample data is aggregated as much as possible, and the different sample data is discrete as much as possible.
Optionally, the clustering method selected when clustering the sample data set includes at least one of a Partition-Based clustering method (Partition-Based method), a Density-Based clustering method (Density-Based method), a Hierarchical clustering method (Hierarchical method), and the like. Specifically, the partition type Clustering method comprises a K-Means Clustering Algorithm (K-Means Clustering Algorithm, K-Means), a K-Means + +, a binary K-Means Clustering Algorithm (bi-Means Clustering Algorithm, bi-Means), a Kernel K-Means Clustering Algorithm (Kernel K-Means Clustering Algorithm), and the like; the Density-Based Clustering method comprises a Density-Based Noise space Clustering algorithm (DBSCAN), a sorting point identification Clustering Structure algorithm (OPTICS), and the like; the hierarchical Clustering method includes a bottom-up Clustering algorithm (aggregate Clustering), a top-down Clustering algorithm (collaborative Clustering), and the like.
Schematically, taking the implementation of clustering by using a k-means algorithm as an example for explanation, the process of clustering sample data in a sample data set includes:
(1) performing feature extraction by using an encoder model to obtain a sample feature vector corresponding to each sample data in the sample data set, and storing the sample feature vector in an internal memory; (2) randomly creating k points as initial clustering center points, wherein k is a positive integer; (3) calculating the distance from any sample data to k cluster center points, classifying the sample data to the cluster with the minimum distance, and iterating for n times; (4) in each iteration process, updating the clustering center point of each cluster by using methods such as mean value and the like; (5) and (4) after the iterative updating of the k clustering central points by the steps (3) and (4), if the position point change is very small (the judgment can be carried out according to a preset position threshold), the stable state is considered to be reached, and the iteration is finished. The k may be indicated by the terminal device, may also be preset by the system, or, if the number of categories to which sample data in the sample data set belongs is known, the number of categories is used as the k value.
Referring to fig. 13, which shows a sample data clustering process schematic diagram provided in an exemplary embodiment of the present application, a sample data set 1300 includes feature vectors (m is a positive integer) corresponding to m sample data, 4 initial clustering center points are randomly set in a feature space corresponding to the sample data set, including a clustering center point a1301, a clustering center point B1302, a clustering center point C1303, and a clustering center point D1304, after iterative clustering, positions of the 4 initial clustering center points change, feature vectors included in a sample cluster a, a sample cluster B1320, a sample cluster C1330, and a sample cluster D1340, which correspond to the clustering center points respectively, also change, and after multiple iterative clustering is completed, four sample clusters are finally obtained.
And explaining the process of screening the candidate sample data to obtain the recalled sample data by combining the clustering process. Illustratively, sample data in the sample data set is clustered to obtain a sample cluster set, the sample cluster set includes a target sample cluster, and cluster center similarity between the target sample data and the target sample cluster is obtained, wherein the cluster center similarity is determined by data content similarity between the target sample and the cluster center sample data, the cluster center sample data is used for indicating a cluster center point of the target sample cluster, and the sample data in the target sample cluster is determined as recall sample data in response to the cluster center similarity corresponding to the target sample cluster meeting a cluster similarity condition. In an example, the clustering similarity condition may be used to indicate that when the cluster center similarity corresponding to the target sample cluster is the highest among the cluster center similarities corresponding to all the sample clusters, the target sample cluster may be determined as the sample cluster to which the target sample data belongs, and the sample data in the target sample cluster is used as the recall sample data corresponding to the target sample data.
1203: and determining similar sample data based on the data content similarity between the target sample data and the recalled sample data.
And after the recall sample data is determined, determining similar sample data corresponding to the target sample data according to the data content similarity between the target sample data and the recall sample data. Illustratively, similarity data between the target sample data and the recall sample data is calculated, and when the similarity data reaches a sample similarity threshold (for example, a sample similarity threshold t uploaded by the terminal device), the recall sample data is determined to be similar sample data.
Referring to fig. 14, which shows a schematic diagram of a cluster index provided in an exemplary embodiment of the present application to determine similar sample data, a sample data set 1400 includes feature vectors (m is a positive integer) corresponding to m sample data, which are divided into k sample clusters (k is a positive integer), for example, a sample cluster a1410, a sample cluster B1420, a sample cluster C1430, a sample cluster D1440, and so on, a similarity calculation is performed on a target feature vector corresponding to a target sample data 1401 and a feature vector corresponding to a cluster center sample data of each sample cluster, a target cluster center sample data 1411 with the highest similarity is determined, and the target cluster center sample data 1411 is a cluster center sample data of the sample cluster a1410, and accordingly, the sample data in the sample cluster a1410 is used as recall sample data, and a similarity between the feature vector corresponding to the target sample data 1401 and the feature vector corresponding to each sample data in the sample cluster a1410 is calculated respectively, thus, n similar sample data 1402(n is a positive integer) whose sample cluster a1410 reaches the sample similarity threshold are screened out.
In summary, the method for determining similar sample data provided in the embodiment of the present application performs clustering on the sample data in the sample data set in advance to determine the recall sample data belonging to the same sample cluster as the target sample data in advance from the candidate sample data, and then performs pairwise matching on the target sample data and the recall sample data to perform similar sample data screening, so as to improve the efficiency of similar sample screening, reduce the data processing amount, and have good applicability in a scene with a high sample data amount.
Referring to fig. 15, a flowchart of a method for generating a contradictory sample library according to an embodiment of the present application is shown, in which after a process of determining a tag annotation quality of sample data is completed, a function of screening the contradictory samples may be further provided. The method comprises the following steps:
1501: and acquiring a label quality condition.
Optionally, the tag matching condition may be uploaded by the terminal device, or may be preset by the system. In some embodiments, when the tag quality condition is that the tag quality judgment is implemented in the form of a threshold, the tag quality threshold corresponding to the tag quality condition and the sample similarity threshold may use the same value, for example, when the sample similarity threshold t indicated by the terminal device is 0.75, the server may determine the tag quality threshold p corresponding to the tag quality condition to be also 0.75.
The label quality condition is used for screening contradictory sample data from the sample data set, wherein the contradictory sample data is used for indicating that the sample data are similar, but the labeled labels are different or the similarity of the sample data is too low.
1502: and traversing the sample data in the sample data set as target sample data to acquire the label marking quality corresponding to the target sample data.
In the embodiment of the application, after all sample data in the sample data set completes the process of determining the label labeling quality, the server generates a corresponding result file, and the result file comprises the sample data ID of the target sample data, the similar sample data ID and the label labeling quality. When a contradiction sample library needs to be generated, the server traverses the result file to obtain the label labeling quality corresponding to the target sample data, wherein the contradiction sample library is used for storing the sample data with similarity relation and label contradiction relation, and the label contradiction relation can be used for indicating that the sample labels of the sample data are different or the label similarity between the sample labels is lower than a certain threshold value.
In some embodiments, the generation of the contradictory sample library may also be performed in parallel with the process of determining the label marking quality by the sample data in the sample data set, for example, after the target sample data determines the similar sample data and the corresponding label marking quality, it may be determined whether to add the target sample data and the corresponding similar sample data to the contradictory sample library according to a comparison condition of the label marking quality and the label quality condition.
1503: and in response to the successful matching of the tag matching information of the target sample data and the tag quality condition, determining the target sample data and the similar sample data as matching sample data.
The matching sample data is used to indicate that sample labels among similar sample data meet a consistency requirement, where the consistency requirement may mean that the sample labels are completely consistent, or that the similarity among the sample labels meets a certain threshold. For example, the sample labels of a plurality of sample data expressing positive emotions are all "positive".
And the server matches the tag matching information of the target sample data with the tag quality condition, if the matching is successful, the target sample data and the corresponding similar sample data do not belong to the contradictory sample data, and if the matching is failed, the target sample data and the corresponding similar sample data belong to the contradictory sample data.
In one example, the label quality threshold p corresponding to the label quality condition is 0.75, when the label labeling quality corresponding to the target sample data is greater than or equal to 0.75, the target sample data is skipped, and when the label labeling quality corresponding to the target sample data is less than 0.75, the target sample data and the similar sample data corresponding to the target sample data are placed in the contradictory sample library.
1504: and responding to the failure of matching of the tag matching information of the target sample data and the tag quality condition, and generating a contradiction sample library based on the target sample data and the similar sample data.
In some embodiments, the contradictory sample library includes at least one contradictory sample set, sample data placed in the contradictory sample library needs to satisfy a duplicate removal screening condition, illustratively, a target sample set is generated according to target sample data and similar sample data, and in response to that the target sample set satisfies the duplicate removal screening condition in the contradictory sample library, the target sample set is saved in the contradictory sample library as the contradictory sample set, and the duplicate removal screening condition is used for removing duplicates of the sample data placed in the contradictory sample library.
Optionally, the duplication elimination screening condition may indicate that, if a second sample label corresponding to similar sample data in the target sample set is the same as the first sample label (or the similarity between the labels is higher than a preset threshold), the similar sample data is removed from the target sample set, and the target sample set after duplication elimination is added to the contradictory sample library.
Optionally, the duplication elimination screening condition may indicate that, if a sample set that coincides with the target sample set is not included in the contradictory sample libraries, the target sample set is added to the contradictory sample libraries as the contradictory sample set, where the coincidence may be partial coincidence or full coincidence.
In one example, contradictory sample data in the contradictory sample library is stored in a contradictory sample set, and when the tag labeling quality corresponding to the target sample data does not meet the tag matching condition, the target sample data and the similar sample data corresponding to the target sample data are added to the contradictory sample library in the form of the contradictory sample set. Namely, the contradiction sample data is displayed to the user in the way of the contradiction sample set, so that the user can perform label relabeling in a small range (taking the contradiction sample set as a unit), and can perform more accurate label tagging on sample data with similar relation.
In some embodiments, the server may further return, to the terminal device, contradictory sample proportion data, which may also reflect the annotation quality of the sample data set, where the contradictory sample proportion data may be defined by formula two. Wherein, the lower the contradiction sample proportion, the higher the labeling quality of the sample data set is, otherwise, the worse the quality is.
The formula II is as follows: contradictory sample proportion ═ (number of contradictory samples/total number of samples) × 100%
Please refer to fig. 16, which illustrates a flowchart of contradictory sample re-calibration provided by an exemplary embodiment of the present application. After the server 1610 generates the contradictory sample library 1601 corresponding to the sample data set, the contradictory sample library 1601 is returned to the terminal device 1620, the terminal device 1620 displays the contradictory sample sets in the contradictory sample library 1601, and when the user selects the target contradictory sample set 1602 in the contradictory sample library 1601, the terminal device 1620 correspondingly displays the contradictory sample data a1603, the contradictory sample data B1604, the contradictory sample data C1605, the contradictory sample data D1606, and the sample data corresponding labels in the target contradictory sample set 1602, where the sample labels of the contradictory sample data a1603 and the contradictory sample data B1604 are both label M, the sample label of the contradictory sample data C1605 is label N, the sample label of the sample data D1606 is label O, and the user can perform label recheck according to the displayed contradictory sample data and perform corresponding label modification according to the recheck result, for example, the sample labels of the contradictory sample data C1605 and the contradictory sample data D1606 are both modified to label M.
In summary, according to the method for generating the contradiction sample library provided by the embodiment of the application, the evaluation standard is further provided for the label labeling quality of the sample data set by generating the contradiction sample library, and the generated contradiction sample library can be provided to the terminal device, so that the user can review the sample label according to the contradiction sample library, the manual review range is reduced, the review workload is reduced, the human resource waste caused by the full-scale re-labeling is avoided, and the optimization efficiency of the labeling label is further improved.
It is understood that in the embodiments of the present application, if the sample data is related to user-related data, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related sample data need to comply with the related laws and regulations and standards of the related countries and regions.
Referring to fig. 17, a block diagram of a device for determining label quality according to an exemplary embodiment of the present application is shown, where the device includes the following modules:
an obtaining module 1710, configured to obtain target sample data from a sample data set, where the sample data in the sample data set is correspondingly labeled with a sample tag, and the target sample data is correspondingly labeled with a first sample tag;
a determining module 1720, configured to determine, based on data content similarity between the target sample data and candidate sample data, similar sample data that meets a similarity requirement with the target sample data from the candidate sample data, where the similar sample data is correspondingly labeled with a second sample tag, and the candidate sample data is sample data in the sample data set that is different from the target sample data;
the obtaining module 1710, further configured to determine, based on the label similarity between the first sample label and the second sample label, label matching information between the first sample label and the second sample label;
the determining module 1720 is further configured to determine, based on the tag matching information, a tag tagging quality of the target sample data.
In some optional embodiments, as shown in fig. 18, the determining module 1720 further includes:
a first obtaining unit 1721, configured to obtain similarity data between the target sample data and the candidate sample data, where the similarity data is used to indicate a similarity of data contents between the target sample data and the candidate sample data;
a first determining unit 1722, configured to determine, in response to that the similarity data reaches a sample similarity threshold, the candidate sample data as the similar sample data, where the sample similarity threshold is used to screen out, from the candidate sample data, sample data having the similarity relationship with the target sample data.
In some optional embodiments, the first obtaining unit 1721 is further configured to obtain a first feature representation corresponding to the target sample data and a second feature representation corresponding to the candidate sample data;
the first determining unit 1722 is further configured to determine the similarity data based on angle data between the first feature representation and the second feature representation, where the angle data is used to indicate an included angle condition formed by the first feature representation and the second feature representation in a feature space; alternatively, the similarity data is determined based on distance data between the first feature representation and the second feature representation, the distance data being indicative of a distance situation of the first feature representation and the second feature representation in the feature space.
In some optional embodiments, the obtaining module 1710 further includes:
a second obtaining unit 1711, configured to obtain similarity data between the target sample data and the similar sample data, where the similarity data is determined by data content similarity between the target sample data and the similar sample data;
the second obtaining unit 1711 is further configured to obtain matching degree data between the first sample label and the second sample label, where the matching degree data is determined by label similarity between the first sample label and the second sample label;
a mapping unit 1712, configured to map the matching degree data to matching weight data corresponding to the similar sample data, where the matching weight data is used to indicate a contribution condition of the similarity data to the tag matching information under the matching degree data;
a second determining unit 1713 configured to determine the tag matching information based on the matching weight data and the similarity data.
In some optional embodiments, the determining module 1720 further comprises:
a screening unit 1723, configured to determine, based on a clustering condition between sample data in the sample data set, recall sample data corresponding to the target sample data from the candidate sample data, where the recall sample data and the target sample data are sample data that belong to the same sample cluster after being clustered;
the first determining unit 1722 is further configured to determine the similar sample data based on a data content similarity between the target sample data and the recall sample data.
In some optional embodiments, the determining module 1720 further comprises:
a clustering unit 1724, configured to cluster sample data in the sample data set to obtain a sample cluster set, where the sample cluster set includes a target sample cluster;
the first obtaining unit 1721 is further configured to obtain a cluster center similarity between the target sample data and the target sample cluster, where the cluster center similarity is determined by a data content similarity between the target sample data and cluster center sample data, and the cluster center sample data is used to indicate a cluster center point of the target sample cluster;
the screening unit 1723 is further configured to determine, in response to that the cluster center similarity corresponding to the target sample cluster meets a cluster similarity condition, sample data in the target sample cluster as the recall sample data.
In some optional embodiments, the obtaining module 1710 is further configured to traverse sample data in the sample data set as the target sample data, and obtain label labeling qualities corresponding to the sample data in the sample data set, respectively;
the determining module 1720 is further configured to use a mean value of label labeling qualities respectively corresponding to sample data in the sample data as set quality data of the sample data set.
In some optional embodiments, the apparatus further comprises:
a generating module 1730, configured to, in response to a failure of matching between the tag labeling quality and the tag quality condition, generate a contradiction sample library based on the target sample data and the similar sample data, where the contradiction sample library is used to store sample data having the similar relationship and having a labeling contradiction relationship.
In some optional embodiments, at least one contradictory sample set is included in the contradictory sample library;
the generating module 1730 further includes:
a generating unit 1731, configured to generate a target sample set according to the target sample data and the similar sample data;
a duplicate removal unit 1732, configured to, in response to that the target sample set meets a duplicate removal screening condition, save the target sample set as the contradictory sample set in the contradictory sample library, where the duplicate removal screening condition is used to remove duplicate samples put in the contradictory sample library.
In some optional embodiments, when the target sample data is labeled with at least two first sample tags, the first obtaining unit 1721 is further configured to obtain a tag weight relationship between the at least two first sample tags;
the first determining unit 1722 is further configured to perform weighted summation on the label labeling quality corresponding to the at least two first sample labels based on the label weight relationship, so as to obtain a comprehensive label labeling quality of the target sample data.
In some optional embodiments, the determining module 1720 is further configured to obtain set quality data of the sample data set based on a label labeling quality of the target sample data;
the device further comprises:
a training module 1740, configured to, in response to that the set quality data meets a set quality condition, input sample data in the sample data set to a model to be trained, and output a training prediction result;
the training module 1740 is further configured to perform iterative training on the model parameters of the model to be trained based on the loss value between the training prediction result and the sample label of the sample data to obtain a target model, where the target model is used to complete a classification task or a regression analysis task.
In some optional embodiments, when the target model is a text emotion classification model, sample data in the sample data set is sample text data, and a sample label of the sample data is used for indicating an emotion category of text content of the sample text data;
the training module 1740 is further configured to perform iterative training on a model parameter of the model to be trained based on a loss value between the training prediction result and the sample label of the sample image data to obtain the road image recognition model, where the road image recognition model is configured to recognize input target image content and determine a road condition category corresponding to the target image content.
In some optional embodiments, when the target model is a road image recognition model, sample data in the sample data set is sample image data, and a sample label of the sample data is used for indicating a road condition category of image content of the sample image data;
the training module 1740, further comprising:
a third obtaining unit 1741, configured to obtain a target loss value between the training prediction result and a sample label of the sample text data;
a training unit 1742, configured to perform iterative training on the model parameters of the model to be trained in response to failure in matching the target loss value with a preset loss range; or responding to the target loss value meeting the preset loss range, and obtaining the text emotion classification model, wherein the text emotion classification model is used for identifying the input target text content and determining the emotion type of the target text content. To sum up, the device for determining label labeling quality provided in the embodiment of the present application obtains similar sample data having a similar relationship with target sample data when quality evaluation needs to be performed on a labeled label after the label labeling of the sample data is completed, and determines the label labeling quality corresponding to the target sample data according to the label similarity between the sample label of the target sample data and the sample label of the similar sample data, thereby automatically implementing an evaluation process on the label quality of the sample data and improving quality evaluation efficiency.
Referring to fig. 19, a block diagram of a device for determining label quality according to an exemplary embodiment of the present application is shown, where the device includes the following modules:
a display module 1910, configured to display an interactive interface, where the interactive interface is provided with a function for determining label labeling quality;
a receiving module 1920, configured to receive a data uploading operation in the interactive interface, where the data uploading operation is used to upload the sample data set including target sample data to a server, where sample data in the sample data set is correspondingly labeled with a sample tag, and the target sample data is correspondingly labeled with a first sample tag;
the display module 1910 is further configured to display a quality analysis result, where the quality analysis result is used to indicate a label tagging quality of the target sample data, where the label tagging quality is determined by the server based on a data content similarity between the target sample data and candidate sample data, and is determined by obtaining label matching information between the first sample label and a second sample label after determining similar sample data meeting a similarity requirement, where the similar sample data is correspondingly tagged with the second sample label, and the candidate sample data is sample data in the sample data set that is different from the target sample data.
To sum up, the device for determining label labeling quality provided in the embodiment of the present application sends the sample data set to the server through the interactive interface after the sample data completes label labeling and when the labeled label needs to be subjected to quality evaluation, and after the server completes determination of the label labeling quality of the sample data, returns the corresponding quality analysis result to the terminal device, and displays the result by the terminal device, thereby realizing visualization of the label labeling quality, enabling a user to conveniently and quickly obtain the label labeling quality corresponding to the sample data, and completing rechecking work of the label labeling.
It should be noted that: the apparatus for determining label quality provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the apparatus for determining label quality and the method for determining label quality provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.
Fig. 20 is a schematic structural diagram of a server according to an exemplary embodiment of the present application. Specifically, the structure includes the following.
The server 2000 includes a Central Processing Unit (CPU) 2001, a system Memory 2004 including a Random Access Memory (RAM) 2002 and a Read Only Memory (ROM) 2003, and a system bus 2005 connecting the system Memory 2004 and the CPU 2001. The server 2000 also includes a mass storage device 2006 for storing an operating system 2013, application programs 2014, and other program modules 2015.
The mass storage device 2006 is connected to the central processing unit 2001 through a mass storage controller (not shown) connected to the system bus 2005. The mass storage device 2006 and its associated computer-readable media provide non-volatile storage for the server 2000. That is, the mass storage device 2006 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.
Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 2004 and mass storage 2006 described above may be collectively referred to as memory.
According to various embodiments of the present application, the server 2000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 2000 may be connected to the network 2012 through a network interface unit 2011 that is coupled to the system bus 2005, or the network interface unit 2011 may be utilized to connect to other types of networks or remote computer systems (not shown).
The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.
Embodiments of the present application further provide a computer device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for determining the quality of tag labeling provided by the foregoing method embodiments. Alternatively, the computer device may be a terminal or a server.
Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored on the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for determining the label labeling quality provided by the foregoing method embodiments.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the method for determining the label labeling quality in any of the above embodiments.
Optionally, the computer-readable storage medium may include: read only memory, random access memory, Solid State Drive (SSD), or optical disc, etc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (19)

1. A method for determining label labeling quality is characterized by comprising the following steps:
acquiring target sample data from a sample data set, wherein the sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
determining similar sample data meeting the similarity requirement with the target sample data from the candidate sample data based on the data content similarity between the target sample data and the candidate sample data, wherein the similar sample data is correspondingly marked with a second sample label, and the candidate sample data is sample data different from the target sample data in the sample data set;
determining label matching information between the first sample label and the second sample label based on label similarity between the first sample label and the second sample label;
and determining the label labeling quality of the target sample data based on the label matching information.
2. The method according to claim 1, wherein determining, from the candidate sample data, similar sample data that meets a similarity requirement with the target sample data based on a data content similarity between the target sample data and the candidate sample data comprises:
obtaining similarity data between the target sample data and the candidate sample data, wherein the similarity data is used for indicating the similarity of data contents between the target sample data and the candidate sample data;
determining the candidate sample data as the similar sample data in response to the similarity data reaching a sample similarity threshold.
3. The method of claim 2, wherein said obtaining similarity data between said target sample data and said candidate sample data comprises:
acquiring a first characteristic representation corresponding to the target sample data and a second characteristic representation corresponding to the candidate sample data;
determining the similarity data based on angle data between the first feature representation and the second feature representation, the angle data being indicative of an included angle condition formed by the first feature representation and the second feature representation in a feature space; alternatively, the similarity data is determined based on distance data between the first feature representation and the second feature representation, the distance data being indicative of a distance situation of the first feature representation and the second feature representation in the feature space.
4. The method of any one of claims 1 to 3, wherein the determining label matching information between the first sample label and the second sample label based on the label similarity between the first sample label and the second sample label comprises:
obtaining similarity data between the target sample data and the similar sample data, wherein the similarity data is determined by the similarity of data contents between the target sample data and the similar sample data;
acquiring matching degree data between the first sample label and the second sample label, wherein the matching degree data is determined by label similarity between the first sample label and the second sample label;
mapping the matching degree data to matching weight data corresponding to the similar sample data, wherein the matching weight data is used for indicating the contribution condition of the similarity data to the label matching information under the matching degree data;
determining the tag matching information based on the matching weight data and the similarity data.
5. The method according to any one of claims 1 to 3, wherein said determining similar sample data meeting a similarity requirement with said target sample data from said candidate sample data based on a data content similarity between said target sample data and said candidate sample data comprises:
determining recall sample data corresponding to the target sample data from the candidate sample data based on the clustering condition among the sample data in the sample data set, wherein the recall sample data and the target sample data are the sample data which belong to the same sample cluster after clustering;
and determining the similar sample data based on the data content similarity between the target sample data and the recalled sample data.
6. The method according to claim 5, wherein said determining, from the candidate sample data, the recalled sample data corresponding to the target sample data based on a clustering condition between sample data in the sample data set comprises:
clustering sample data in the sample data set to obtain a sample cluster set, wherein the sample cluster set comprises a target sample cluster;
obtaining cluster center similarity between the target sample data and the target sample cluster, wherein the cluster center similarity is determined by data content similarity between the target sample data and the cluster center sample data, and the cluster center sample data is used for indicating a cluster center point of the target sample cluster;
and determining sample data in the target sample cluster as the recall sample data in response to that the cluster center similarity corresponding to the target sample cluster meets a cluster similarity condition.
7. The method of any of claims 1 to 3, further comprising:
traversing the sample data in the sample data set as the target sample data to acquire label labeling quality respectively corresponding to the sample data in the sample data set;
and taking the mean value of the label marking quality corresponding to the sample data in the sample data as the set quality data of the sample data set.
8. The method of any of claims 1 to 3, further comprising:
and responding to the failure of matching of the label labeling quality and the label quality condition, and generating a contradiction sample library based on the target sample data and the similar sample data, wherein the contradiction sample library is used for storing the sample data with the similar relation and the labeling contradiction relation.
9. The method of claim 8, wherein the contradictory sample library includes at least one set of contradictory samples;
the generating a contradictory sample library based on the target sample data and the similar sample data comprises:
generating a target sample set according to the target sample data and the similar sample data;
and in response to the target sample set meeting a duplication elimination screening condition, saving the target sample set as the contradictory sample set in the contradictory sample library, wherein the duplication elimination screening condition is used for carrying out duplication elimination on the sample data put in the contradictory sample library.
10. The method according to any one of claims 1 to 3, wherein when said target sample data is labeled with at least two first sample labels, said determining the label labeling quality of said target sample data based on said label matching information comprises:
acquiring a label weight relation between the at least two first sample labels;
and based on the label weight relationship, performing weighted summation on the label labeling quality corresponding to the at least two first sample labels to obtain the comprehensive label labeling quality of the target sample data.
11. The method of any of claims 1 to 3, further comprising:
acquiring set quality data of the sample data set based on the label labeling quality of the target sample data;
responding to the set quality data meeting a set quality condition, inputting sample data in the sample data set to a model to be trained, and outputting a training prediction result;
and performing iterative training on the model parameters of the model to be trained based on the loss value between the training prediction result and the sample label of the sample data to obtain a target model, wherein the target model is used for completing a classification task or a regression analysis task.
12. The method according to claim 11, wherein when the target model is a road image recognition model, sample data in the sample data set is sample image data, and a sample label of the sample data is used for indicating a road condition category of image content of the sample image data;
performing iterative training on the model parameters of the model to be trained based on the loss value between the training prediction result and the sample label of the sample data to obtain a target model, wherein the iterative training comprises the following steps:
and performing iterative training on model parameters of the model to be trained based on the loss value between the training prediction result and the sample label of the sample image data to obtain the road image recognition model, wherein the road image recognition model is used for recognizing input target image content and determining the road condition category corresponding to the target image content.
13. The method of claim 11, wherein when the target model is a text sentiment classification model, sample data in the sample data set is sample text data, and a sample label of the sample data is used for indicating a sentiment category of text content of the sample text data;
performing iterative training on the model parameters of the model to be trained based on the loss value between the training prediction result and the sample label of the sample data to obtain a target model, wherein the iterative training comprises the following steps:
obtaining a target loss value between the training prediction result and a sample label of the sample text data;
performing iterative training on the model parameters of the model to be trained in response to failure in matching the target loss value with a preset loss range; or responding to the target loss value meeting the preset loss range, and obtaining the text emotion classification model, wherein the text emotion classification model is used for identifying the input target text content and determining the emotion type of the target text content.
14. A method for determining label labeling quality is characterized by comprising the following steps:
displaying an interactive interface, wherein the interactive interface is provided with a label quality determining function, and the label quality determining function is used for determining the label labeling quality of sample data in the sample data set;
receiving a data uploading operation in the interactive interface, wherein the data uploading operation is used for uploading the sample data set comprising target sample data to a server, sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
displaying a quality analysis result, wherein the quality analysis result is used for indicating the label labeling quality of the target sample data, the label labeling quality is determined by the server by obtaining label matching information between the first sample label and a second sample label after determining similar sample data meeting the requirement of similarity based on the similarity of data content between the target sample data and candidate sample data, the similar sample data is correspondingly labeled with the second sample label, and the candidate sample data is sample data in the sample data set, which is different from the target sample data.
15. An apparatus for determining label quality, the apparatus comprising:
the acquisition module is used for acquiring target sample data from a sample data set, wherein sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
a determining module, configured to determine, based on a data content similarity between the target sample data and candidate sample data, similar sample data that meets a similarity requirement with the target sample data from the candidate sample data, where the similar sample data is correspondingly labeled with a second sample tag, and the candidate sample data is sample data in the sample data set that is different from the target sample data;
the obtaining module is further configured to determine, based on the tag similarity between the first sample tag and the second sample tag, tag matching information between the first sample tag and the second sample tag;
the determining module is further configured to determine the label labeling quality of the target sample data based on the label matching information.
16. An apparatus for determining label quality, the apparatus comprising:
the display module is used for displaying an interactive interface, and the interactive interface is provided with a function for determining the label marking quality;
the receiving module is used for receiving data uploading operation in the interactive interface, the data uploading operation is used for uploading the sample data set comprising target sample data to a server, sample data in the sample data set is correspondingly marked with a sample label, and the target sample data is correspondingly marked with a first sample label;
the display module is further configured to display a quality analysis result, where the quality analysis result is used to indicate tag tagging quality of the target sample data, the tag tagging quality is determined by the server based on data content similarity between the target sample data and candidate sample data, and after similar sample data meeting a similarity requirement is determined, the similar sample data is determined by obtaining tag matching information between the first sample tag and a second sample tag, where the similar sample data is correspondingly tagged with the second sample tag, and the candidate sample data is sample data in the sample data set that is different from the target sample data.
17. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of determining label labeling quality as claimed in any one of claims 1 to 14.
18. A computer-readable storage medium, having at least one program code stored therein, the program code being loaded and executed by a processor to implement the method of determining label quality according to any of claims 1 to 14.
19. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a method of determining label quality as claimed in any one of claims 1 to 14.
CN202210027995.5A 2022-01-11 2022-01-11 Method, device, equipment, medium and product for determining label marking quality Pending CN114372532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210027995.5A CN114372532A (en) 2022-01-11 2022-01-11 Method, device, equipment, medium and product for determining label marking quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210027995.5A CN114372532A (en) 2022-01-11 2022-01-11 Method, device, equipment, medium and product for determining label marking quality

Publications (1)

Publication Number Publication Date
CN114372532A true CN114372532A (en) 2022-04-19

Family

ID=81143375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210027995.5A Pending CN114372532A (en) 2022-01-11 2022-01-11 Method, device, equipment, medium and product for determining label marking quality

Country Status (1)

Country Link
CN (1) CN114372532A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187956A (en) * 2023-04-25 2023-05-30 上海百通项目管理咨询有限公司 Method and system for generating bidding documents
CN117689998A (en) * 2024-01-31 2024-03-12 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium
CN117689998B (en) * 2024-01-31 2024-05-03 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187956A (en) * 2023-04-25 2023-05-30 上海百通项目管理咨询有限公司 Method and system for generating bidding documents
CN116187956B (en) * 2023-04-25 2023-07-18 上海百通项目管理咨询有限公司 Method and system for generating bidding documents
CN117689998A (en) * 2024-01-31 2024-03-12 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium
CN117689998B (en) * 2024-01-31 2024-05-03 数据空间研究院 Nonparametric adaptive emotion recognition model, method, system and storage medium

Similar Documents

Publication Publication Date Title
EP3985578A1 (en) Method and system for automatically training machine learning model
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
CN110765301B (en) Picture processing method, device, equipment and storage medium
CN113469214A (en) False news detection method and device, electronic equipment and storage medium
CN115687647A (en) Notarization document generation method and device, electronic equipment and storage medium
Wu et al. A multi-level descriptor using ultra-deep feature for image retrieval
CN113656561A (en) Entity word recognition method, apparatus, device, storage medium and program product
Jimenez et al. An empirical study on identifying sentences with salient factual statements
US11176311B1 (en) Enhanced section detection using a combination of object detection with heuristics
CN114372532A (en) Method, device, equipment, medium and product for determining label marking quality
CN117312562A (en) Training method, device, equipment and storage medium of content auditing model
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN116976321A (en) Text processing method, apparatus, computer device, storage medium, and program product
US20220012424A1 (en) Word and image relationships in combined vector space
CN117058432B (en) Image duplicate checking method and device, electronic equipment and readable storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Berg et al. Do you see what I see? Measuring the semantic differences in image‐recognition services' outputs
CN116796723B (en) Text set matching method and device, electronic equipment and storage medium
US20230419044A1 (en) Tagging for subject matter or learning schema
CN113821498A (en) Data screening method, device, equipment and medium
Jony et al. Domain specific fine tuning of pre-trained language model in NLP
CN114332526A (en) Pathological image classification method, pathological image classification device, pathological image classification equipment, storage medium and program product
Akram et al. From Data Quality to Model Performance: Navigating the Landscape of Deep Learning Model Evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination