CN116226557B

CN116226557B - Method and device for picking up data to be marked, electronic equipment and storage medium

Info

Publication number: CN116226557B
Application number: CN202211726541.3A
Authority: CN
Inventors: 柳厅文; 谢明轩; 王玉斌; 谭斌; 刘庆云
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2024-04-19
Anticipated expiration: 2042-12-29
Also published as: CN116226557A

Abstract

The invention discloses a method and a device for picking up data to be marked, electronic equipment and a storage medium, and relates to the field of data marking. The method comprises the following steps: acquiring a labeling task participated by a user; providing an input format for fixed data for the user based on the labeling task; highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area; determining a webpage area and content captured by the user in the data capture area through user behaviors; and transmitting the webpage area and the content to a data labeling system based on the input format. The method can accurately complete data acquisition and format verification and synchronously submit the data to the labeling system, thereby greatly improving the labeling efficiency.

Description

Method and device for picking up data to be marked, electronic equipment and storage medium

Technical Field

The invention belongs to the field of data annotation, and relates to a method and a device for picking up data to be annotated, electronic equipment and a storage medium.

Background

With the rapid development of deep learning technology, the application of artificial intelligence is integrated into various industries, wherein the computer vision technology and the natural language processing technology have great application scenes, such as automatic driving of automobiles, face recognition, graph searching, target detection, intelligent question-answering and the like. Currently, computer vision and natural language processing are two most popular research fields in the deep learning field, and training of a neural network in the field requires a large amount of labeling data, so that along with the continuous development of technology, high-quality labeling data has long-term requirements. Therefore, the method has a great pushing effect on the landing of the deep learning model when a large amount of high-quality marking data is obtained efficiently, however, different marking data are often needed to face different application scenes, the data are not always ready to be marked manually by a user, a plurality of marking systems and methods can finish the data marking process, but in the prior art, the process that the user obtains unlabeled data is not considered to be very complicated, the unlabeled data are often needed to be used after the crawler is collected and then cleaned, the data of some scenes are not widely distributed on the Internet, the efficiency of the crawler batch collection is very low due to the rare data, and meanwhile, the time for the user to screen the effective data is increased, so that the whole data marking process costs too much time. Taking the data labeling system (the authorization number CN 113407980B) of the invention as an example, the current data labeling method does not consider the problem of obtaining unlabeled data, but the data can be used only after the crawler is required to collect the related data of the Internet and to screen and clean, so that the method can not efficiently obtain the data which are not widely distributed in the Internet and are difficult to collect through analysis of the structured web pages, and a great amount of time is spent from the data collection to the labeling.

Disclosure of Invention

Aiming at the problems, in order to efficiently complete data acquisition and labeling generation of high-quality labeling data, the invention provides a method and a device for picking up the data to be labeled, electronic equipment and a storage medium. The data types suitable for the invention comprise texts and images, and can accurately finish data acquisition and format verification and synchronously submit the data to the labeling system in the face of sparse distribution data which is difficult to collect by a crawler, thereby greatly improving the labeling efficiency.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a pick-up method of data to be marked comprises the following steps:

Acquiring a labeling task participated by a user;

Providing an input format for fixed data for the user based on the labeling task;

Highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area;

Determining a webpage area and content captured by the user in the data capture area through user behaviors;

and transmitting the webpage area and the content to a data labeling system based on the input format.

Further, before the labeling task participated by the user is obtained, the method further includes:

The identity of the user is verified.

Further, the providing an input format for fixed data for the user based on the labeling task includes:

Acquiring the task type of the labeling task; the task types include: text classification, named entity recognition, text generation, image classification, or cross-modal text generation;

in the case that the task type is text classification, the input format is a piece of text;

in the case that the task type is named entity recognition, the input format is a text;

in the case that the task type is text generation, the input format is two texts;

In the case that the task type is image classification, the input format is an image;

In the case that the task type is cross-modal text generation, the input format is two texts and one image;

Or alternatively, the first and second heat exchangers may be,

Receiving input format setting for the labeling task sent by the user;

Generating an input format for fixed data based on the input format setting; wherein the input format includes: at least one text and/or at least one image.

Further, the acquired data includes: text data;

The highlighting the collected data under the labeling task on the page being browsed by the user comprises the following steps:

extracting elements from the html source codes of the page being browsed; wherein the tag of the element comprises: < a >, < span > and < p >;

And under the condition that the element is a leaf node and the content of the element comprises at least one acquired data, performing a highlighting operation on the position corresponding to the element on the page being browsed.

Further, the acquired data includes: image data;

and in the case that the acquired data is image data, highlighting the acquired data under the labeling task on the page being browsed by the user, including:

Obtaining md5 codes of all the image url in the page being browsed by the user;

When the md5 code of the acquired data is the same as the md5 code of at least one image url, performing a highlighting operation on the position corresponding to the image on the page being browsed.

Further, the determining, by user behavior, the web page area and the content captured by the user in the data capturing area includes:

After the user uses a mouse to draw a section of webpage area or content, monitoring whether the user executes a specific action or not; wherein the performing a specific action includes: pressing a shortcut key or clicking a menu page;

In the event that the user performs a particular action, the scratched web page area or content is captured.

acquiring HTML source codes of pages being browsed by the user;

based on the webpage content clicked by the user, analyzing the xpath of the element in the area where the cursor is positioned in the webpage DOM by utilizing the HTML source code;

And acquiring the positions with the same xpath in the page being browsed by the user, and highlighting the corresponding element node content to show the captured webpage area and the content to the user.

A pickup device for data to be marked, comprising:

the task selection module is used for acquiring labeling tasks participated by the user; providing an input format for fixed data for the user based on the labeling task;

The data deduplication module is used for highlighting the acquired data under the labeling task on the page browsed by the user so as to obtain a data capturing area;

the behavior monitoring module is used for determining the webpage area and the content captured by the user in the data capturing area through the user behavior;

And the data transmission module is used for transmitting the webpage area and the content to a data labeling system based on the input format.

An electronic device, the electronic device comprising: a processor and a memory storing computer program instructions; the processor implements the method for picking up the data to be marked according to any one of the above when executing the computer program instructions.

A computer readable storage medium, wherein computer program instructions are stored on the computer readable storage medium, and when executed by a processor, the computer program instructions implement the method for picking up data to be marked according to any one of the above.

Compared with the prior art, the plug-in unit provided by the invention is used for completing data acquisition and has the following advantages:

1. The plug-in acquires the appointed data from the Internet according to the user behavior, and the data and a small amount of data existing in batches aiming at the structured web page can be accurately acquired, so that the plug-in is fast and efficient.

2. When a user browses a webpage, the plug-in marks the collected content in the webpage, can remind the user to avoid repeated collection, and improves the effective rate of data collection.

3. The plug-in unit is linked with the marking system, collected data is automatically formatted into a data format required by the marking system, and a user

After the data is pushed to the labeling system, the labeling can be directly started in the system, data preprocessing is not needed, and the time cost of a user is reduced.

4. The plug-in is oriented to text and image data, realizes a data acquisition mode of single text, single image and text-image, and can be used for text annotation, image annotation and image-text cross-mode annotation.

Drawings

Fig. 1 is a flowchart of a method for picking up data to be marked according to the present invention.

FIG. 2 is a flow chart of the present invention for deduplication of acquired data markers.

Figure 3 is a flow chart of a single data acquisition of the present invention.

Fig. 4 is a block diagram of a pickup device for data to be marked according to the present invention.

Fig. 5 is a block diagram of an electronic device of the present invention.

Detailed Description

In order to make the above features and advantages of the present invention more comprehensible, the following description refers to embodiments accompanied with the present invention.

The collecting main body of the method for picking up the data to be marked is a browser (Chrome) plug-in for the service of a data marking system, as shown in figure 1, and the method comprises the following steps.

Step 110: and acquiring the labeling task participated by the user.

Before the labeling task participated by the user is obtained, the identity of the user needs to be verified, namely the user needs to pick up the data to be labeled after logging in.

In one embodiment, the user login of the data to be marked and the data marking system can use the same authentication system, share the same database and share browser login information, and the user can complete single sign-on only by selecting one login in the marking system or the plug-in.

Step 120: based on the annotation task, the user is provided with an input format for the fixed data.

The amount of data needed varies as different data formats are needed for different tasks. For example, text classification requires one text, text generation requires two texts, image classification requires one image, and the like, so that the input format of data needs to be fixed by selecting different labeling tasks, and the acquired data can be directly applied to data labeling.

In one embodiment, the user selects a task in which he or she participates in the task options. After the plug-in acquires task types, defining data forms of acquisition pages through different task types, and pre-establishing mapping between each task type and the text and the image number of each piece of data corresponding to each task type, wherein the specific mapping is as follows:

{ "text classification": "text: 1, image: 0",

"Named entity identification": "text 1, image: 0",

"Text generation": "text: 2, image: 0",

"Image classification": "text: 0, image: 1",

"Cross-modal text generation": "text: 2, image: 1"}

When the user selects the labeling task, the module acquires the text and the image quantity of each piece of data corresponding to the task type in the mapping table, and dynamically controls the quantity of the text and the image input boxes on the plug-in page, so that the data submitted by the current form meets the requirement of the task type.

In another embodiment, the user needs to create a new task, and then the user needs to enter the name of the task and the corresponding number of text and/or images in the designation field. At this time, the system can synchronize the data in the newly added mapping table, so that the plug-in can adapt to the data collection of different types of tasks.

Step 130: and highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area.

After the user selects the labeling task, the module acquires all acquired data under the task from the labeling system, and marks the acquired data on the page being browsed by the user so as to remind the user of not repeating the acquisition.

Specifically, after the module obtains all the collected data, each piece of data is traversed one by one, if the data is text, the module extracts all the elements of the tags of < a >, < span >, < p > from the html source code of the webpage, and when the elements are leaf nodes in the DOM tree and contain the text content, the text style is modified to prompt the user. If the data is an image, the module compares the md5 code of the image url with the md5 code of all the images (images) url in the web page, and if the codes are the same, covers the image element with a cover layer of the same size (the image is embedded into div, and div color and transparency are set) to prompt the user that the image is acquired, and a schematic diagram of the label duplication removal process is shown in fig. 2.

Step 140: and determining the webpage area and the content captured by the user in the data capture area through the user behavior.

After the user wakes up the plug-in by means of a shortcut key or the like, the plug-in determines the area and content of the web page to be captured through the user behavior.

Specifically, the plug-in calls the interface chrome.

In the single acquisition mode: for text data, after a user uses a mouse to scratch a section of text content, the module monitors whether a menu page function item is pressed, when the user is monitored to press a shortcut key or click a menu page option, the module automatically captures the content scratched by the mouse, when the content is image data (MEDIATYPE is image), the module acquires url of the image and stores base64 codes of the image in a local space, and the url is dynamically loaded in an image form by a plug-in and then displayed on a plug-in page; when the content contains the selected text (selectionText), the module obtains the text content drawn by the user and stores the text content in the local space, and the text content is dynamically loaded by the plug-in and then displayed on the plug-in page. The number of text and image input boxes in the plug-in page is changed according to the task type of the task selected by the user, when all the text and image input boxes on the plug-in page are not completely filled, the user cannot execute the submitting operation, so that the normalization of the acquired data format is ensured, and a schematic diagram of a single data acquisition flow is shown in fig. 3.

In the batch acquisition mode, the module automatically acquires the HTML source code of the current page, a user wakes up the plug-in through a shortcut key, clicks the webpage content, the module automatically analyzes the xpath of the element in the area where the cursor is located in the webpage DOM, and re-renders the webpage, and the node content of the element with the same xpath is highlighted, so that the batch acquisition result is conveniently displayed to the user, and different fields need to be defined for the acquisition result of the element with different xpath.

Step 150: based on the input format, the web page area and the content are transmitted to a data annotation system.

The acquired data is transmitted to a data storage module of the data labeling system in a fixed data format, and the task selection module determines the input format of the data in advance according to the task type, so that the transmitted data meets the format requirement of single task data, and the data can be directly used for a data labeling flow.

In summary, the method and the device can collect the specified data from the Internet according to the user behaviors, and can accurately collect data existing in batches and a small amount of data of the structured web pages, thereby being fast and efficient; when a user browses a webpage, the collected content in the webpage can be marked, so that the repeated collection of the user is avoided, and the effective rate of data collection is improved; the system can be linked with the marking system, collected data is automatically formatted into a data format required by the marking system, a user can directly start marking in the system after pushing the data to the marking system, data preprocessing is not needed, and the time cost of the user is reduced; the method realizes the data acquisition modes of single text, single image and text-image, and can be used for text annotation, image annotation and image-text cross-mode annotation.

Fig. 4 is a block diagram illustrating a data pickup device to be marked according to an exemplary embodiment, the device including: the system comprises a task selection module, a data deduplication module, a behavior monitoring module and a data transmission module.

And the data transmission module is used for transmitting the webpage area and the content to a data labeling system based on the input format. The exemplary apparatus is an apparatus embodiment corresponding to the above exemplary method, and specific operations of the respective modules may be understood with reference to the description of the method embodiment, which is not repeated herein.

Fig. 5 is a block diagram of an electronic device, according to an example embodiment. The electronic device may be a computer device, a notebook computer, a server, or other type of electronic device.

An electronic device may include at least one processor and memory. The processor may execute instructions stored in the memory. The processor is communicatively coupled to the memory via a data bus. In addition to the memory, the processor may also be communicatively coupled to input devices, output devices, and communication devices via a data bus.

The processor may be any conventional processor. The processor may include, for example, a central processing unit (Central Processing Unit, CPU), an image processor (GraphicProcessUnit, GPU), a field programmable gate array (FieldProgrammableGateArray, FPGA), a system on a chip (SystemonChip, SOC), an application specific integrated chip (ApplicationSpecificIntegratedCircuit, ASIC), or a combination thereof.

The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In an embodiment of the present disclosure, the memory stores executable instructions that the processor may read from the memory and execute to implement all or part of the steps of the vehicle maneuver stability assessment method in the exemplary embodiment described above.

In addition to the methods and apparatus described above, exemplary embodiments of the present disclosure include a computer program product or a computer-readable storage medium storing the computer program product. The computer program instructions are embodied in a computer program instruction that is executable by a processor to implement all or part of the steps described in the above exemplary embodiments.

The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages, as well as scripting languages (e.g., python). The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include: a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk, or any suitable combination of the foregoing having one or more electrical conductors.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. The specification and embodiments are to be regarded as exemplary only, and the disclosure is not limited to the exact construction illustrated and described above, and various modifications and changes may be made without departing from the scope thereof.

Claims

1. A method of picking up data to be annotated, comprising:

Acquiring a labeling task participated by a user;

Highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area; wherein,

And in the case that the collected data is text data, highlighting the collected data under the labeling task on the page being browsed by the user, wherein the method comprises the following steps:

Under the condition that the element is a leaf node and the content of the element comprises at least one acquired data, performing a highlighting operation on a position corresponding to the element on the page being browsed;

Obtaining md5 codes of all the image url in the page being browsed by the user;

when the md5 code of the acquired data is the same as the md5 code of at least one image url, performing a highlighting operation on the position corresponding to the image on the page being browsed;

2. The method for picking up data to be annotated according to claim 1, wherein before the annotation task participated by the user is obtained, further comprising:

The identity of the user is verified.

3. The method for picking up data to be annotated according to claim 1, wherein said providing an input format for fixed data to said user based on said annotation task comprises:

Or alternatively, the first and second heat exchangers may be,

Receiving input format setting for the labeling task sent by the user;

4. The method for picking up data to be annotated according to claim 1, wherein said determining, by user behavior, web page areas and contents captured by the user in the data capturing area comprises:

5. The method for picking up data to be annotated according to claim 1, wherein said determining, by user behavior, web page areas and contents captured by the user in the data capturing area comprises:

acquiring HTML source codes of pages being browsed by the user;

6. A pickup device for data to be marked, comprising:

The data deduplication module is used for highlighting the acquired data under the labeling task on the page browsed by the user so as to obtain a data capturing area; wherein,

Obtaining md5 codes of all the image url in the page being browsed by the user;

7. An electronic device, the electronic device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements a method for picking up data to be marked according to any one of claims 1-5.

8. A computer-readable storage medium, on which computer program instructions are stored which, when executed by a processor, implement a method of picking up data to be marked according to any one of claims 1-5.