CN116226557B - Method and device for picking up data to be marked, electronic equipment and storage medium - Google Patents

Method and device for picking up data to be marked, electronic equipment and storage medium Download PDF

Info

Publication number
CN116226557B
CN116226557B CN202211726541.3A CN202211726541A CN116226557B CN 116226557 B CN116226557 B CN 116226557B CN 202211726541 A CN202211726541 A CN 202211726541A CN 116226557 B CN116226557 B CN 116226557B
Authority
CN
China
Prior art keywords
data
user
page
task
browsed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211726541.3A
Other languages
Chinese (zh)
Other versions
CN116226557A (en
Inventor
柳厅文
谢明轩
王玉斌
谭斌
刘庆云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202211726541.3A priority Critical patent/CN116226557B/en
Publication of CN116226557A publication Critical patent/CN116226557A/en
Application granted granted Critical
Publication of CN116226557B publication Critical patent/CN116226557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method and a device for picking up data to be marked, electronic equipment and a storage medium, and relates to the field of data marking. The method comprises the following steps: acquiring a labeling task participated by a user; providing an input format for fixed data for the user based on the labeling task; highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area; determining a webpage area and content captured by the user in the data capture area through user behaviors; and transmitting the webpage area and the content to a data labeling system based on the input format. The method can accurately complete data acquisition and format verification and synchronously submit the data to the labeling system, thereby greatly improving the labeling efficiency.

Description

Method and device for picking up data to be marked, electronic equipment and storage medium
Technical Field
The invention belongs to the field of data annotation, and relates to a method and a device for picking up data to be annotated, electronic equipment and a storage medium.
Background
With the rapid development of deep learning technology, the application of artificial intelligence is integrated into various industries, wherein the computer vision technology and the natural language processing technology have great application scenes, such as automatic driving of automobiles, face recognition, graph searching, target detection, intelligent question-answering and the like. Currently, computer vision and natural language processing are two most popular research fields in the deep learning field, and training of a neural network in the field requires a large amount of labeling data, so that along with the continuous development of technology, high-quality labeling data has long-term requirements. Therefore, the method has a great pushing effect on the landing of the deep learning model when a large amount of high-quality marking data is obtained efficiently, however, different marking data are often needed to face different application scenes, the data are not always ready to be marked manually by a user, a plurality of marking systems and methods can finish the data marking process, but in the prior art, the process that the user obtains unlabeled data is not considered to be very complicated, the unlabeled data are often needed to be used after the crawler is collected and then cleaned, the data of some scenes are not widely distributed on the Internet, the efficiency of the crawler batch collection is very low due to the rare data, and meanwhile, the time for the user to screen the effective data is increased, so that the whole data marking process costs too much time. Taking the data labeling system (the authorization number CN 113407980B) of the invention as an example, the current data labeling method does not consider the problem of obtaining unlabeled data, but the data can be used only after the crawler is required to collect the related data of the Internet and to screen and clean, so that the method can not efficiently obtain the data which are not widely distributed in the Internet and are difficult to collect through analysis of the structured web pages, and a great amount of time is spent from the data collection to the labeling.
Disclosure of Invention
Aiming at the problems, in order to efficiently complete data acquisition and labeling generation of high-quality labeling data, the invention provides a method and a device for picking up the data to be labeled, electronic equipment and a storage medium. The data types suitable for the invention comprise texts and images, and can accurately finish data acquisition and format verification and synchronously submit the data to the labeling system in the face of sparse distribution data which is difficult to collect by a crawler, thereby greatly improving the labeling efficiency.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a pick-up method of data to be marked comprises the following steps:
Acquiring a labeling task participated by a user;
Providing an input format for fixed data for the user based on the labeling task;
Highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area;
Determining a webpage area and content captured by the user in the data capture area through user behaviors;
and transmitting the webpage area and the content to a data labeling system based on the input format.
Further, before the labeling task participated by the user is obtained, the method further includes:
The identity of the user is verified.
Further, the providing an input format for fixed data for the user based on the labeling task includes:
Acquiring the task type of the labeling task; the task types include: text classification, named entity recognition, text generation, image classification, or cross-modal text generation;
in the case that the task type is text classification, the input format is a piece of text;
in the case that the task type is named entity recognition, the input format is a text;
in the case that the task type is text generation, the input format is two texts;
In the case that the task type is image classification, the input format is an image;
In the case that the task type is cross-modal text generation, the input format is two texts and one image;
Or alternatively, the first and second heat exchangers may be,
Receiving input format setting for the labeling task sent by the user;
Generating an input format for fixed data based on the input format setting; wherein the input format includes: at least one text and/or at least one image.
Further, the acquired data includes: text data;
The highlighting the collected data under the labeling task on the page being browsed by the user comprises the following steps:
extracting elements from the html source codes of the page being browsed; wherein the tag of the element comprises: < a >, < span > and < p >;
And under the condition that the element is a leaf node and the content of the element comprises at least one acquired data, performing a highlighting operation on the position corresponding to the element on the page being browsed.
Further, the acquired data includes: image data;
and in the case that the acquired data is image data, highlighting the acquired data under the labeling task on the page being browsed by the user, including:
Obtaining md5 codes of all the image url in the page being browsed by the user;
When the md5 code of the acquired data is the same as the md5 code of at least one image url, performing a highlighting operation on the position corresponding to the image on the page being browsed.
Further, the determining, by user behavior, the web page area and the content captured by the user in the data capturing area includes:
After the user uses a mouse to draw a section of webpage area or content, monitoring whether the user executes a specific action or not; wherein the performing a specific action includes: pressing a shortcut key or clicking a menu page;
In the event that the user performs a particular action, the scratched web page area or content is captured.
Further, the determining, by user behavior, the web page area and the content captured by the user in the data capturing area includes:
acquiring HTML source codes of pages being browsed by the user;
based on the webpage content clicked by the user, analyzing the xpath of the element in the area where the cursor is positioned in the webpage DOM by utilizing the HTML source code;
And acquiring the positions with the same xpath in the page being browsed by the user, and highlighting the corresponding element node content to show the captured webpage area and the content to the user.
A pickup device for data to be marked, comprising:
the task selection module is used for acquiring labeling tasks participated by the user; providing an input format for fixed data for the user based on the labeling task;
The data deduplication module is used for highlighting the acquired data under the labeling task on the page browsed by the user so as to obtain a data capturing area;
the behavior monitoring module is used for determining the webpage area and the content captured by the user in the data capturing area through the user behavior;
And the data transmission module is used for transmitting the webpage area and the content to a data labeling system based on the input format.
An electronic device, the electronic device comprising: a processor and a memory storing computer program instructions; the processor implements the method for picking up the data to be marked according to any one of the above when executing the computer program instructions.
A computer readable storage medium, wherein computer program instructions are stored on the computer readable storage medium, and when executed by a processor, the computer program instructions implement the method for picking up data to be marked according to any one of the above.
Compared with the prior art, the plug-in unit provided by the invention is used for completing data acquisition and has the following advantages:
1. The plug-in acquires the appointed data from the Internet according to the user behavior, and the data and a small amount of data existing in batches aiming at the structured web page can be accurately acquired, so that the plug-in is fast and efficient.
2. When a user browses a webpage, the plug-in marks the collected content in the webpage, can remind the user to avoid repeated collection, and improves the effective rate of data collection.
3. The plug-in unit is linked with the marking system, collected data is automatically formatted into a data format required by the marking system, and a user
After the data is pushed to the labeling system, the labeling can be directly started in the system, data preprocessing is not needed, and the time cost of a user is reduced.
4. The plug-in is oriented to text and image data, realizes a data acquisition mode of single text, single image and text-image, and can be used for text annotation, image annotation and image-text cross-mode annotation.
Drawings
Fig. 1 is a flowchart of a method for picking up data to be marked according to the present invention.
FIG. 2 is a flow chart of the present invention for deduplication of acquired data markers.
Figure 3 is a flow chart of a single data acquisition of the present invention.
Fig. 4 is a block diagram of a pickup device for data to be marked according to the present invention.
Fig. 5 is a block diagram of an electronic device of the present invention.
Detailed Description
In order to make the above features and advantages of the present invention more comprehensible, the following description refers to embodiments accompanied with the present invention.
The collecting main body of the method for picking up the data to be marked is a browser (Chrome) plug-in for the service of a data marking system, as shown in figure 1, and the method comprises the following steps.
Step 110: and acquiring the labeling task participated by the user.
Before the labeling task participated by the user is obtained, the identity of the user needs to be verified, namely the user needs to pick up the data to be labeled after logging in.
In one embodiment, the user login of the data to be marked and the data marking system can use the same authentication system, share the same database and share browser login information, and the user can complete single sign-on only by selecting one login in the marking system or the plug-in.
Step 120: based on the annotation task, the user is provided with an input format for the fixed data.
The amount of data needed varies as different data formats are needed for different tasks. For example, text classification requires one text, text generation requires two texts, image classification requires one image, and the like, so that the input format of data needs to be fixed by selecting different labeling tasks, and the acquired data can be directly applied to data labeling.
In one embodiment, the user selects a task in which he or she participates in the task options. After the plug-in acquires task types, defining data forms of acquisition pages through different task types, and pre-establishing mapping between each task type and the text and the image number of each piece of data corresponding to each task type, wherein the specific mapping is as follows:
{ "text classification": "text: 1, image: 0",
"Named entity identification": "text 1, image: 0",
"Text generation": "text: 2, image: 0",
"Image classification": "text: 0, image: 1",
"Cross-modal text generation": "text: 2, image: 1"}
When the user selects the labeling task, the module acquires the text and the image quantity of each piece of data corresponding to the task type in the mapping table, and dynamically controls the quantity of the text and the image input boxes on the plug-in page, so that the data submitted by the current form meets the requirement of the task type.
In another embodiment, the user needs to create a new task, and then the user needs to enter the name of the task and the corresponding number of text and/or images in the designation field. At this time, the system can synchronize the data in the newly added mapping table, so that the plug-in can adapt to the data collection of different types of tasks.
Step 130: and highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area.
After the user selects the labeling task, the module acquires all acquired data under the task from the labeling system, and marks the acquired data on the page being browsed by the user so as to remind the user of not repeating the acquisition.
Specifically, after the module obtains all the collected data, each piece of data is traversed one by one, if the data is text, the module extracts all the elements of the tags of < a >, < span >, < p > from the html source code of the webpage, and when the elements are leaf nodes in the DOM tree and contain the text content, the text style is modified to prompt the user. If the data is an image, the module compares the md5 code of the image url with the md5 code of all the images (images) url in the web page, and if the codes are the same, covers the image element with a cover layer of the same size (the image is embedded into div, and div color and transparency are set) to prompt the user that the image is acquired, and a schematic diagram of the label duplication removal process is shown in fig. 2.
Step 140: and determining the webpage area and the content captured by the user in the data capture area through the user behavior.
After the user wakes up the plug-in by means of a shortcut key or the like, the plug-in determines the area and content of the web page to be captured through the user behavior.
Specifically, the plug-in calls the interface chrome.
In the single acquisition mode: for text data, after a user uses a mouse to scratch a section of text content, the module monitors whether a menu page function item is pressed, when the user is monitored to press a shortcut key or click a menu page option, the module automatically captures the content scratched by the mouse, when the content is image data (MEDIATYPE is image), the module acquires url of the image and stores base64 codes of the image in a local space, and the url is dynamically loaded in an image form by a plug-in and then displayed on a plug-in page; when the content contains the selected text (selectionText), the module obtains the text content drawn by the user and stores the text content in the local space, and the text content is dynamically loaded by the plug-in and then displayed on the plug-in page. The number of text and image input boxes in the plug-in page is changed according to the task type of the task selected by the user, when all the text and image input boxes on the plug-in page are not completely filled, the user cannot execute the submitting operation, so that the normalization of the acquired data format is ensured, and a schematic diagram of a single data acquisition flow is shown in fig. 3.
In the batch acquisition mode, the module automatically acquires the HTML source code of the current page, a user wakes up the plug-in through a shortcut key, clicks the webpage content, the module automatically analyzes the xpath of the element in the area where the cursor is located in the webpage DOM, and re-renders the webpage, and the node content of the element with the same xpath is highlighted, so that the batch acquisition result is conveniently displayed to the user, and different fields need to be defined for the acquisition result of the element with different xpath.
Step 150: based on the input format, the web page area and the content are transmitted to a data annotation system.
The acquired data is transmitted to a data storage module of the data labeling system in a fixed data format, and the task selection module determines the input format of the data in advance according to the task type, so that the transmitted data meets the format requirement of single task data, and the data can be directly used for a data labeling flow.
In summary, the method and the device can collect the specified data from the Internet according to the user behaviors, and can accurately collect data existing in batches and a small amount of data of the structured web pages, thereby being fast and efficient; when a user browses a webpage, the collected content in the webpage can be marked, so that the repeated collection of the user is avoided, and the effective rate of data collection is improved; the system can be linked with the marking system, collected data is automatically formatted into a data format required by the marking system, a user can directly start marking in the system after pushing the data to the marking system, data preprocessing is not needed, and the time cost of the user is reduced; the method realizes the data acquisition modes of single text, single image and text-image, and can be used for text annotation, image annotation and image-text cross-mode annotation.
Fig. 4 is a block diagram illustrating a data pickup device to be marked according to an exemplary embodiment, the device including: the system comprises a task selection module, a data deduplication module, a behavior monitoring module and a data transmission module.
The task selection module is used for acquiring labeling tasks participated by the user; providing an input format for fixed data for the user based on the labeling task;
The data deduplication module is used for highlighting the acquired data under the labeling task on the page browsed by the user so as to obtain a data capturing area;
the behavior monitoring module is used for determining the webpage area and the content captured by the user in the data capturing area through the user behavior;
And the data transmission module is used for transmitting the webpage area and the content to a data labeling system based on the input format. The exemplary apparatus is an apparatus embodiment corresponding to the above exemplary method, and specific operations of the respective modules may be understood with reference to the description of the method embodiment, which is not repeated herein.
Fig. 5 is a block diagram of an electronic device, according to an example embodiment. The electronic device may be a computer device, a notebook computer, a server, or other type of electronic device.
An electronic device may include at least one processor and memory. The processor may execute instructions stored in the memory. The processor is communicatively coupled to the memory via a data bus. In addition to the memory, the processor may also be communicatively coupled to input devices, output devices, and communication devices via a data bus.
The processor may be any conventional processor. The processor may include, for example, a central processing unit (Central Processing Unit, CPU), an image processor (GraphicProcessUnit, GPU), a field programmable gate array (FieldProgrammableGateArray, FPGA), a system on a chip (SystemonChip, SOC), an application specific integrated chip (ApplicationSpecificIntegratedCircuit, ASIC), or a combination thereof.
The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
In an embodiment of the present disclosure, the memory stores executable instructions that the processor may read from the memory and execute to implement all or part of the steps of the vehicle maneuver stability assessment method in the exemplary embodiment described above.
In addition to the methods and apparatus described above, exemplary embodiments of the present disclosure include a computer program product or a computer-readable storage medium storing the computer program product. The computer program instructions are embodied in a computer program instruction that is executable by a processor to implement all or part of the steps described in the above exemplary embodiments.
The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages, as well as scripting languages (e.g., python). The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the readable storage medium include: a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk, or any suitable combination of the foregoing having one or more electrical conductors.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. The specification and embodiments are to be regarded as exemplary only, and the disclosure is not limited to the exact construction illustrated and described above, and various modifications and changes may be made without departing from the scope thereof.

Claims (8)

1. A method of picking up data to be annotated, comprising:
Acquiring a labeling task participated by a user;
Providing an input format for fixed data for the user based on the labeling task;
Highlighting the acquired data under the labeling task on the page being browsed by the user to obtain a data capturing area; wherein,
And in the case that the collected data is text data, highlighting the collected data under the labeling task on the page being browsed by the user, wherein the method comprises the following steps:
extracting elements from the html source codes of the page being browsed; wherein the tag of the element comprises: < a >, < span > and < p >;
Under the condition that the element is a leaf node and the content of the element comprises at least one acquired data, performing a highlighting operation on a position corresponding to the element on the page being browsed;
and in the case that the acquired data is image data, highlighting the acquired data under the labeling task on the page being browsed by the user, including:
Obtaining md5 codes of all the image url in the page being browsed by the user;
when the md5 code of the acquired data is the same as the md5 code of at least one image url, performing a highlighting operation on the position corresponding to the image on the page being browsed;
Determining a webpage area and content captured by the user in the data capture area through user behaviors;
and transmitting the webpage area and the content to a data labeling system based on the input format.
2. The method for picking up data to be annotated according to claim 1, wherein before the annotation task participated by the user is obtained, further comprising:
The identity of the user is verified.
3. The method for picking up data to be annotated according to claim 1, wherein said providing an input format for fixed data to said user based on said annotation task comprises:
Acquiring the task type of the labeling task; the task types include: text classification, named entity recognition, text generation, image classification, or cross-modal text generation;
in the case that the task type is text classification, the input format is a piece of text;
in the case that the task type is named entity recognition, the input format is a text;
in the case that the task type is text generation, the input format is two texts;
In the case that the task type is image classification, the input format is an image;
In the case that the task type is cross-modal text generation, the input format is two texts and one image;
Or alternatively, the first and second heat exchangers may be,
Receiving input format setting for the labeling task sent by the user;
Generating an input format for fixed data based on the input format setting; wherein the input format includes: at least one text and/or at least one image.
4. The method for picking up data to be annotated according to claim 1, wherein said determining, by user behavior, web page areas and contents captured by the user in the data capturing area comprises:
After the user uses a mouse to draw a section of webpage area or content, monitoring whether the user executes a specific action or not; wherein the performing a specific action includes: pressing a shortcut key or clicking a menu page;
In the event that the user performs a particular action, the scratched web page area or content is captured.
5. The method for picking up data to be annotated according to claim 1, wherein said determining, by user behavior, web page areas and contents captured by the user in the data capturing area comprises:
acquiring HTML source codes of pages being browsed by the user;
based on the webpage content clicked by the user, analyzing the xpath of the element in the area where the cursor is positioned in the webpage DOM by utilizing the HTML source code;
And acquiring the positions with the same xpath in the page being browsed by the user, and highlighting the corresponding element node content to show the captured webpage area and the content to the user.
6. A pickup device for data to be marked, comprising:
the task selection module is used for acquiring labeling tasks participated by the user; providing an input format for fixed data for the user based on the labeling task;
The data deduplication module is used for highlighting the acquired data under the labeling task on the page browsed by the user so as to obtain a data capturing area; wherein,
And in the case that the collected data is text data, highlighting the collected data under the labeling task on the page being browsed by the user, wherein the method comprises the following steps:
extracting elements from the html source codes of the page being browsed; wherein the tag of the element comprises: < a >, < span > and < p >;
Under the condition that the element is a leaf node and the content of the element comprises at least one acquired data, performing a highlighting operation on a position corresponding to the element on the page being browsed;
and in the case that the acquired data is image data, highlighting the acquired data under the labeling task on the page being browsed by the user, including:
Obtaining md5 codes of all the image url in the page being browsed by the user;
when the md5 code of the acquired data is the same as the md5 code of at least one image url, performing a highlighting operation on the position corresponding to the image on the page being browsed;
the behavior monitoring module is used for determining the webpage area and the content captured by the user in the data capturing area through the user behavior;
And the data transmission module is used for transmitting the webpage area and the content to a data labeling system based on the input format.
7. An electronic device, the electronic device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements a method for picking up data to be marked according to any one of claims 1-5.
8. A computer-readable storage medium, on which computer program instructions are stored which, when executed by a processor, implement a method of picking up data to be marked according to any one of claims 1-5.
CN202211726541.3A 2022-12-29 2022-12-29 Method and device for picking up data to be marked, electronic equipment and storage medium Active CN116226557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211726541.3A CN116226557B (en) 2022-12-29 2022-12-29 Method and device for picking up data to be marked, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211726541.3A CN116226557B (en) 2022-12-29 2022-12-29 Method and device for picking up data to be marked, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116226557A CN116226557A (en) 2023-06-06
CN116226557B true CN116226557B (en) 2024-04-19

Family

ID=86575997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211726541.3A Active CN116226557B (en) 2022-12-29 2022-12-29 Method and device for picking up data to be marked, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116226557B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2508500A1 (en) * 2004-06-24 2005-12-24 Avaya Technology Corp. An architecture for ink annotations on web documents
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN110362822A (en) * 2019-06-18 2019-10-22 中国平安财产保险股份有限公司 Text marking method, apparatus, computer equipment and storage medium for model training
CN111639284A (en) * 2020-05-29 2020-09-08 深圳壹账通智能科技有限公司 Webpage labeling method and device, electronic equipment and medium
CN113641838A (en) * 2021-08-11 2021-11-12 上海明略人工智能(集团)有限公司 Device and method for data annotation, electronic equipment and readable storage medium
CN113961848A (en) * 2021-11-09 2022-01-21 北京锐安科技有限公司 Webpage element labeling processing method and device, electronic equipment and storage medium
CN114049631A (en) * 2021-11-06 2022-02-15 企查查科技有限公司 Data labeling method and device, computer equipment and storage medium
CN114461886A (en) * 2022-02-16 2022-05-10 北京百度网讯科技有限公司 Labeling method, labeling device, electronic equipment and storage medium
CN114780891A (en) * 2022-03-14 2022-07-22 中国科学院信息工程研究所 Website key resource analysis method and device based on page rendering contribution degree
WO2022220311A1 (en) * 2021-04-12 2022-10-20 카페24 주식회사 Automatic interworking method, device, and system between heterogeneous platforms

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003131988A (en) * 2001-10-26 2003-05-09 Matsushita Electric Ind Co Ltd Home page update device, home page update method, home page update program recording medium and home page update program
US7962547B2 (en) * 2009-01-08 2011-06-14 International Business Machines Corporation Method for server-side logging of client browser state through markup language
US20110258526A1 (en) * 2010-04-20 2011-10-20 International Business Machines Corporation Web content annotation management web browser plug-in
US20170147159A1 (en) * 2015-11-19 2017-05-25 International Business Machines Corporation Capturing and storing dynamic page state data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2508500A1 (en) * 2004-06-24 2005-12-24 Avaya Technology Corp. An architecture for ink annotations on web documents
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN110362822A (en) * 2019-06-18 2019-10-22 中国平安财产保险股份有限公司 Text marking method, apparatus, computer equipment and storage medium for model training
CN111639284A (en) * 2020-05-29 2020-09-08 深圳壹账通智能科技有限公司 Webpage labeling method and device, electronic equipment and medium
WO2022220311A1 (en) * 2021-04-12 2022-10-20 카페24 주식회사 Automatic interworking method, device, and system between heterogeneous platforms
CN113641838A (en) * 2021-08-11 2021-11-12 上海明略人工智能(集团)有限公司 Device and method for data annotation, electronic equipment and readable storage medium
CN114049631A (en) * 2021-11-06 2022-02-15 企查查科技有限公司 Data labeling method and device, computer equipment and storage medium
CN113961848A (en) * 2021-11-09 2022-01-21 北京锐安科技有限公司 Webpage element labeling processing method and device, electronic equipment and storage medium
CN114461886A (en) * 2022-02-16 2022-05-10 北京百度网讯科技有限公司 Labeling method, labeling device, electronic equipment and storage medium
CN114780891A (en) * 2022-03-14 2022-07-22 中国科学院信息工程研究所 Website key resource analysis method and device based on page rendering contribution degree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于动态网页解析的微博数据抓取方法;钟明翔;唐晋韬;谢松县;王挺;;舰船电子工程;20151020(10);全文 *

Also Published As

Publication number Publication date
CN116226557A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2020134991A1 (en) Automatic input method for paper form, apparatus , and computer device and storage medium
WO2022041406A1 (en) Ocr and transfer learning-based app violation monitoring method
CN111125598A (en) Intelligent data query method, device, equipment and storage medium
CN102158365A (en) User clustering method and system in weblog mining
Patnaik et al. Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks
CN112232352B (en) Automatic pricing system and method for intelligent recognition of PCB drawing
US20200225927A1 (en) Methods and systems for automating computer application tasks using application guides, markups and computer vision
CN113760825A (en) Visual user operation backtracking method and device, computer equipment and storage medium
CN112927776A (en) Artificial intelligence automatic interpretation system for medical inspection report
CN112560411A (en) Intelligent personnel information input method and system
CN112328806A (en) Data processing method, system, computer equipment and storage medium
EP3564833B1 (en) Method and device for identifying main picture in web page
CN116226557B (en) Method and device for picking up data to be marked, electronic equipment and storage medium
CN115373649B (en) Dynamic internet content barrier-free transformation method and device and website content barrier-free transformation method
US20230049389A1 (en) Text-based machine learning extraction of table data from a read-only document
CN116453125A (en) Data input method, device, equipment and storage medium based on artificial intelligence
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
CN111428724B (en) Examination paper handwriting statistics method, device and storage medium
CN115881257A (en) User privacy protection method and system applied to big data
CN113836092A (en) File comparison method, device, equipment and storage medium based on RPA and AI
CN114118072A (en) Document structuring method and device, electronic equipment and computer readable storage medium
TWI680666B (en) Method and system for identifying users on internet
Prasad et al. Face-Based Alumni Tracking on Social Media Using Deep Learning
CN104010111A (en) Image processing method and device
CN210804423U (en) Website information acquisition and release platform system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant