CN111310693A

CN111310693A - Intelligent labeling method and device for text in image and storage medium

Info

Publication number: CN111310693A
Application number: CN202010118420.5A
Authority: CN
Inventors: 黄杰; 袁星宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-19
Anticipated expiration: 2040-02-26
Also published as: CN111310693B

Abstract

The invention provides an intelligent labeling method and device for texts in images, electronic equipment and a storage medium; the method comprises the following steps: screenshot is carried out on a page containing text content to obtain an image to be annotated containing the text content; the text content is matched with a target language; extracting text contents in the page to obtain a target language text of the page; carrying out optical character recognition on the image to be marked to obtain an optical character recognition text corresponding to the image to be marked; acquiring a target text corresponding to the optical character recognition text in the target language text; performing text annotation on the image to be annotated based on the target text to obtain an image annotation sample; by the method and the device, automatic text labeling can be performed on the image, sample labeling efficiency is improved, and a large number of labeled samples are provided for model training in a short time.

Description

Intelligent labeling method and device for text in image and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an intelligent labeling method and device for texts in images, electronic equipment and a storage medium.

Background

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. Among them, Computer Vision technology (CV) is a science that studies how to "see" a machine, and generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, and Optical Character Recognition (OCR).

As the artificial intelligence technology is mature, the image OCR technology is more and more widely applied to daily life. In order to obtain an image OCR recognition model with higher recognition accuracy, the image OCR model needs to be trained by a large number of labeled samples. In the related art, when a training sample is constructed, manual labeling such as manual input is usually adopted, so that the labor cost is consumed, the labeling efficiency of the sample is greatly reduced, and great difficulty is brought to model training.

Disclosure of Invention

The embodiment of the invention provides an intelligent labeling method and device for texts in images, electronic equipment and a storage medium, which can be used for carrying out automatic text labeling on the images, improving sample labeling efficiency and providing a large number of labeled samples for model training in a short time.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an intelligent labeling method for texts in images, which comprises the following steps:

screenshot is carried out on a page containing text content to obtain an image to be annotated containing the text content; the text content is matched with a target language;

extracting text contents in the page to obtain a target language text of the page;

carrying out optical character recognition on the image to be marked to obtain an optical character recognition text corresponding to the image to be marked;

acquiring a target text corresponding to the optical character recognition text in the target language text;

and performing text labeling on the image to be labeled based on the target text to obtain an image labeling sample, wherein the image labeling sample is used for training an optical character recognition model, so that the optical character recognition model obtained by training can perform text recognition on the input image to be recognized containing the text of the target language and output a recognition text corresponding to the target language.

The embodiment of the present invention further provides an intelligent labeling device for texts in images, including:

the screenshot module is used for screenshot of a page containing text content to obtain an image to be annotated containing the text content; the text content is matched with a target language;

the text extraction module is used for extracting text contents in the page to obtain a target language text of the page;

the identification module is used for carrying out optical character identification on the image to be marked to obtain an optical character identification text corresponding to the image to be marked;

the acquisition module is used for acquiring a target text corresponding to the optical character recognition text in the target language text;

and the marking module is used for carrying out text marking on the image to be marked based on the target text to obtain an image marking sample, and the image marking sample is used for training an optical character recognition model so that the optical character recognition model obtained by training can carry out text recognition on the input image to be recognized containing the text of the target language and output a recognition text corresponding to the target language.

In the above scheme, the screenshot module is further configured to simulate a browsing process of the page based on an automated testing tool;

and in the simulated browsing process, screenshot is carried out on the page to obtain the image to be annotated.

In the above scheme, the screenshot module is further configured to simulate a browser through the automated testing tool, and open a page corresponding to the target website based on the browser obtained through simulation;

adjusting the window size of the browser obtained through simulation to the size of a target window;

and circularly rolling the page corresponding to the target website in the window of the browser with the size of the target window to realize the browsing of the page.

In the above scheme, the screenshot module is further configured to obtain a first screenshot time and a screenshot period corresponding to a last screenshot performed on the page;

when the second screenshot time is determined to reach based on the first screenshot time and the screenshot period, acquiring a browsing state corresponding to the page;

and when the representation of the browsing state does not browse to the bottom of the page, performing screenshot on the page to obtain the image to be annotated.

In the above scheme, the text extraction module is further configured to extract text content in the page to obtain original text information;

carrying out character coding on the original text information to obtain a corresponding coded text;

and carrying out text cleaning on the coded text to filter symbols of a target type to obtain the target language text.

In the above scheme, the obtaining module is further configured to perform text analysis on the optical character recognition text to obtain single-line texts included in the optical character recognition text;

respectively obtaining text identifications corresponding to the single-line texts, wherein the text identifications are used for identifying the corresponding single-line texts;

and acquiring a single-line target text corresponding to each single-line text in the target language text based on the text identification, and taking each acquired single-line target text as the target text.

In the above scheme, the obtaining module is further configured to extract the first word and the last word of each single-line text respectively, and use the extracted first word and last word as text identifiers of the corresponding single-line text;

correspondingly, the obtaining, based on the text identifier, a single line of target text corresponding to each single line of text in the target language text includes:

and respectively carrying out word matching on the head and tail words of each single-line text and the text in the target language text so as to obtain a single-line target text corresponding to each single-line text in the target language text based on a matching result.

In the above scheme, the labeling module is further configured to determine a target position of an optical character recognition text corresponding to the target text in the image to be labeled;

and binding the target text with the target position to realize text annotation of the image to be annotated.

An embodiment of the present invention further provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the intelligent labeling method of the text in the image provided by the embodiment of the invention when the executable instructions stored in the memory are executed.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and when the executable instructions are executed by a processor, the intelligent labeling method for the text in the image provided by the embodiment of the invention is realized.

The embodiment of the invention has the following beneficial effects:

obtaining an image to be marked containing a text by screenshot of a page containing the text of a target language, and obtaining an optical character recognition text of the image to be marked by utilizing an optical character recognition technology; then, text extraction is carried out on the screenshot page to obtain a target language text of the page, then a target text corresponding to the optical character recognition text is searched in the target language text, and text labeling is carried out on the image to be labeled based on the target text to obtain an image labeling sample; therefore, in the process of text annotation of the whole image, the content to be annotated does not need to be manually identified and input, the automation of text annotation of the image is realized, the annotation efficiency of the sample is improved, a large number of annotation samples are provided for model training in a short time, and the difficulty in the model training is relieved.

Drawings

Fig. 1 is a schematic diagram of a method for labeling text in an image provided in the related art;

FIG. 2 is a schematic view of an implementation scenario of an intelligent method for labeling a text in an image according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an intelligent method for labeling text in an image according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an image to be annotated according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a target position of an optical character recognition text in an image to be annotated according to an embodiment of the present invention;

FIG. 7 is a data flow diagram of an intelligent method for labeling texts in images according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for intelligently labeling text in an image according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an intelligent labeling apparatus for text in an image according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, to enable embodiments of the invention described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) Optical Character Recognition (OCR) converts characters of various bills, newspapers, books, documents and other printed matters into image information by an Optical input method such as scanning, and converts the image information into a usable computer input technology by using a Character Recognition technology.

In order to obtain an image OCR recognition model with higher recognition accuracy, the image OCR model needs to be trained by a large number of labeled samples. In the related art, training samples are usually constructed in a manual labeling manner, such as manual input. Referring to fig. 1, fig. 1 is a schematic diagram of a method for labeling a text in an image provided in the related art, and a related text input area is presented for an image to be labeled, and a worker manually inputs text content of the viewed image to label the text of the image. The sample labeling mode not only consumes the labor cost, but also greatly reduces the labeling efficiency of the sample.

Nowadays, some OCR recognition needs for texts in the lesser languages such as cantonese language also exist. Due to the difference of language culture of various regions, when the manual annotation is used, language barriers cause that a worker cannot intuitively know the text content of the image to be annotated or even write out the text content (see the text enclosed by the square frame in fig. 1), so that the worker cannot provide accurate annotation content, and great difficulty is brought to model training. In the related art, when a person cannot accurately read or spell a language text such as cantonese, the person usually uses an external device such as a handwriting board to intervene, or uses a third-party translation engine to implement the process. However, external equipment is costly, insensitive, and vulnerable; when the translation method is implemented by means of a third-party translation engine, the operation steps of manual annotation are increased, time consumption is serious, and manual annotation efficiency is further reduced.

Based on this, embodiments of the present invention provide an intelligent labeling method and apparatus for text in an image, an electronic device, and a storage medium, so as to at least solve the above problems in the related art, which will be described below.

An implementation scenario of the intelligent method for labeling a text in an image according to the embodiment of the present invention is described below. Referring to fig. 2, fig. 2 is a schematic view of an implementation scenario of the method for intelligently labeling text in an image according to an embodiment of the present invention, in order to support an exemplary application, a terminal (including the terminal 200-1 and the terminal 200-2) is connected to the server 100 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless or wired link to implement data transmission.

The terminal (such as the terminal 200-1) is used for responding to a text annotation instruction triggered by a user and sending a text annotation request to the server; the server 100 is configured to respond to a text annotation request, and perform screenshot on a page including text content to obtain an image to be annotated including the text content; extracting text contents in the page to obtain a target language text of the page; carrying out optical character recognition on the image to be marked to obtain an optical character recognition text corresponding to the image to be marked; acquiring a target text corresponding to the optical character recognition text in the target language text; performing text annotation on an image to be annotated based on the target text to obtain an image annotation sample;

the terminal (such as the terminal 200-1) is also used for responding to the model training instruction aiming at the optical character recognition model and sending a model training request to the server;

the server 100 is also used for responding to the model training request and training an optical character recognition model based on the image labeling sample; thus, the trained optical character recognition model is obtained.

In some embodiments, a user may operate a terminal (e.g., terminal 200-2) to send a text recognition request for an image to be recognized to the server 100, where the image to be recognized contains text in a target language;

in response to the text recognition request, the server 100 performs text recognition on the image to be recognized by using the trained optical character recognition model, outputs a recognition text corresponding to the target language, and returns the recognition text to the terminal (e.g., the terminal 200-2), and the terminal presents the recognition text corresponding to the image to be recognized.

In practical applications, the server 100 may be a server configured independently to support various services, or may be a server cluster; the terminal (e.g., terminal 200-1) may be any type of user terminal such as a smartphone, tablet, laptop, etc., and may also be a wearable computing device, a Personal Digital Assistant (PDA), a desktop computer, a cellular phone, a media player, a navigation device, a game console, a television, or a combination of any two or more of these or other data processing devices.

The following describes in detail a hardware structure of an electronic device of the method for intelligently labeling texts in images according to an embodiment of the present invention, with reference to fig. 3, where fig. 3 is a schematic structural diagram of the electronic device according to the embodiment of the present invention, and an electronic device 300 shown in fig. 3 includes: at least one processor 310, memory 350, at least one network interface 320, and a user interface 330. The various components in electronic device 300 are coupled together by a bus system 340. It will be appreciated that the bus system 340 is used to enable communications among the components connected. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 340 in fig. 3.

The Processor 310 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 330 includes one or more output devices 331, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The user interface 330 also includes one or more input devices 332, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310.

The memory 350 may include either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 350 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 350 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.

An operating system 351 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 352 for communicating to other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 including: bluetooth, wireless compatibility authentication (WiF i), and Universal Serial Bus (USB), etc.;

a presentation module 353 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 331 (e.g., a display screen, speakers, etc.) associated with the user interface 330;

an input processing module 354 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the intelligent labeling apparatus for text in image provided by the embodiments of the present invention can be implemented in software, and fig. 3 illustrates an intelligent labeling apparatus 355 for text in image stored in a memory 350, which can be software in the form of programs and plug-ins, and includes the following software modules: the screenshot module 3551, the text extraction module 3552, the recognition module 3553, the retrieval module 3554, and the annotation module 3555, which are logical and thus can be arbitrarily combined or further separated according to the implemented functions, which will be described below.

In other embodiments, the intelligent labeling apparatus for text in image provided by the embodiments of the present invention may be implemented by a combination of hardware and software, and as an example, the intelligent labeling apparatus for text in image provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the intelligent labeling method for text in image provided by the embodiments of the present invention, for example, the processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following describes in detail an intelligent method for labeling text in an image according to an embodiment of the present invention. Referring to fig. 4, fig. 4 is a schematic flowchart of an intelligent method for labeling a text in an image according to an embodiment of the present invention; in some embodiments, the intelligent annotation method for text in an image may be implemented by a server or a terminal alone, or implemented by a server and a terminal in a cooperative manner, taking the server as an example, the intelligent annotation method for text in an image provided by an embodiment of the present invention includes:

step 401: and the server captures the page containing the text content to obtain the image to be annotated containing the text content.

Here, the text content matches a target language, which may be a popular type language such as cantonese, that is, a language that cannot be input based on the current input method, or a language that cannot be recognized and spelled by a annotator.

In practical application, a large number of related websites and webpages containing texts in a target language need to be acquired, and information such as websites and URLs of the websites and the webpages is reserved. When the image to be annotated is obtained, the image to be annotated containing the text can be obtained by screenshot of the webpage with the text of the target language.

Exemplarily, referring to fig. 5, fig. 5 is a schematic diagram of an image to be annotated provided in an embodiment of the present invention, where the image to be annotated is obtained by capturing a screen of a server for a web page containing a cantonese text.

In some embodiments, the server may obtain the image to be annotated containing the text content by: simulating a browsing process of the page based on an automatic test tool; and in the simulated browsing process, screenshot is carried out on the page to obtain an image to be annotated.

In practical application, the browser can be driven by an automatic testing tool, and operations such as website opening, webpage browsing and the like of the browser are simulated, so that the browsing process of a simulated page is realized; and in the simulated browsing process, screenshot is carried out on the page to obtain an image to be annotated containing the page text.

In some embodiments, the server may simulate the browsing process of the page by: simulating a browser through an automatic testing tool, and opening a page corresponding to a target website based on the browser obtained through simulation; adjusting the window size of the browser obtained by simulation to the size of a target window; and circularly rolling the page corresponding to the target website in the window of the browser with the size of the target window to realize the browsing of the page.

Based on a stored target website, such as a webpage URL, an automatic testing tool, such as a Selenium tool, is started to simulate the browser, and the browser obtained through simulation realizes operations such as opening and browsing a page corresponding to the target website.

In practical applications, the window size of the browser can be preset. In order to make the layout and size of the text contained in the simulated page meet the user viewing standard so as to obtain the image to be labeled with a proper size, multiple tests show that the window size of the browser is 1400px 900px most suitable, and therefore, 1400px 900px can be set as the target window size of the browser window. When the browser is simulated through the Selenium tool, the window size of the browser can be adjusted to the size of the target window, so that the image to be annotated with the appropriate size can be conveniently obtained.

And after the size of the window of the browser is adjusted to the size of the target window, circularly rolling the page corresponding to the target website in the window of the browser with the size of the target window so as to realize the simulation of page browsing operation.

In some embodiments, the server may screenshot the page by: acquiring first screenshot time and screenshot period corresponding to the last screenshot of the page; when the second screenshot time is determined to reach based on the first screenshot time and the screenshot period, acquiring a browsing state corresponding to the page; and when the representation of the browsing state does not browse to the bottom of the page, performing screenshot on the page to obtain an image to be annotated.

When the browsing process of the page is simulated, screenshot processing is carried out on the page to obtain an image to be annotated.

In practical application, a screenshot period can be set, and then screenshot is performed based on the set screenshot period. Specifically, when each screenshot is performed, first screenshot time for performing the previous screenshot on the page needs to be acquired, and when it is determined that second screenshot time is reached based on the first screenshot time and the screenshot period, that is, when the second screenshot time is reached by adding the first screenshot time and the time of the screenshot period, the last screenshot is finished, and a next screenshot is to be started. At this time, the browsing status of the current page needs to be judged, that is, whether the simulated browsing reaches the bottom of the page at this time is judged. And when the obtained representation of the browsing state does not browse to the bottom of the page, carrying out screenshot on the page again. Thus, the image to be annotated containing the text is obtained.

In practical application, the content of each screenshot can be determined by the size of the page which can be presented each time on the screen of the display device instead of setting the screenshot period, that is, each screenshot only needs to be presented when the current browser window is displayed in a full screen. Specifically, the page can be scrolled cyclically by executing JavaScript window.scrollto (0, document.document element.scrolltop +900), and an effect of just scrolling a page of the screen size of one display device is achieved at each time of scrolling.

Step 402: and extracting the text content in the page to obtain the target language text of the page.

Here, the text content in the page is extracted, so as to obtain the target language text of the page.

In some embodiments, the server may obtain the target language text of the page by: extracting text contents in the page to obtain original text information; carrying out character coding on the original text information to obtain a corresponding coded text; and carrying out text cleaning on the coded text to filter the symbols of the target type to obtain the target language text.

In practical application, extracting the text content in the page to obtain the original text information of the page; and performing character coding on the original text, and storing the original text as a coded text with a preset format. Specifically, TEXT extraction of the page can be realized through an HTML2TEXT library, and original TEXT information of the page is obtained; and performing character encoding on the original text information in a UTF-8 encoding mode to obtain and store a UTF-8 encoded text.

Since the text extracted from the page contains a large amount of useless symbol information, text cleaning is required to be performed on the coded text obtained by coding so as to filter out useless content. In practical application, the stored coded text can be imported into a pre-written text cleaning script, and the text cleaning is performed on the coded text through the script to filter out symbols of target types, such as special symbols, emoticons and the like, so as to obtain a target language text of a corresponding page.

Step 403: and carrying out optical character recognition on the image to be marked to obtain an optical character recognition text corresponding to the image to be marked.

And performing OCR recognition on the intercepted image to be marked by adopting an OCR recognition technology, so as to obtain an OCR text corresponding to the image to be marked.

Step 404: and acquiring a target text corresponding to the optical character recognition text in the target language text.

After the target language text of the page and the OCR text of the image to be labeled are respectively obtained, the text corresponding to the OCR text is searched in the target language text to be used as the target text for text labeling.

In some embodiments, the server may obtain the target text by: performing text analysis on the optical character recognition text to obtain single-line texts contained in the optical character recognition text; respectively acquiring text identifications corresponding to the single-line texts, wherein the text identifications are used for identifying the corresponding single-line texts; and acquiring a single-line target text corresponding to each single-line text in the target language text based on the text identification, and taking each single-line target text as the target text.

In practical application, firstly, performing text analysis on the OCR text to obtain each single-line text contained in the OCR text; then respectively acquiring text identifiers of the single-line texts to identify the corresponding single-line texts; and acquiring target single-line texts corresponding to the single-line texts in the target language texts through text identification, so that the acquired single-line target texts are used as target texts corresponding to the OCR texts.

In some embodiments, the server may obtain the text identifier corresponding to each single line of text by: and respectively extracting the head and the tail of each single line of text, and taking the extracted head and tail as the text identifications of the corresponding single line of text.

In practical application, when the text identifier corresponding to each single-line text is obtained, the head and the tail of each single-line text can be respectively extracted, and the extracted head and tail are used as the text identifier of each single-line text; in addition, the keywords of each single line of text can be extracted as the text identifiers of the corresponding single line of text.

Based on this, the server can obtain the single-line target text corresponding to each single-line text in the following way: and respectively carrying out word matching on the head and tail words of each single line of text and the text in the target language text so as to obtain a single line of target text corresponding to each single line of text in the target language text based on the matching result.

Performing word matching on the first word and the last word of each single line of text and the text in the target language text to obtain a matching result; thereby determining a single line of target text corresponding to each single line of text in the target language text based on the obtained matching result. Specifically, by means of head and tail word searching, based on the head and tail words of each single line of text, searching a text matched with the head and tail words in the target language text, and taking the line of text matched with the head and tail words of the single line of text in the target language text as the single line of target text corresponding to the single line of text.

Step 405: and carrying out text annotation on the image to be annotated based on the target text to obtain an image annotation sample.

The image annotation sample is used for training the optical character recognition model, so that the trained optical character recognition model can perform text recognition on the input image to be recognized, which contains the text of the target language, and output the recognition text corresponding to the target language.

In some embodiments, the server may perform text annotation on the image to be annotated by: determining a target position of an optical character recognition text corresponding to a target text in an image to be marked; and binding the target text with the target position to realize text annotation of the image to be annotated.

When text annotation is performed on the image to be annotated based on the target text, firstly, a target position of the OCR text corresponding to the target text in the image to be annotated is determined. Referring to fig. 6, fig. 6 is a schematic diagram of a target position of an optical character recognition text in an image to be labeled according to an embodiment of the present invention, where each single line of text is enclosed by a rectangular region frame, and when a target position of an OCR text corresponding to a target text in the image to be labeled is determined, position coordinates of the region frame in which each single line of text included in the OCR text is located are determined, and a specific coordinate value may be set as required.

Based on the above, when the target text is bound with the target position, each single line of target text in the target text can be bound with the position coordinates of each corresponding single line of text in the image to be labeled. Therefore, the text annotation of the image to be annotated is realized, and the image annotation sample with text information is obtained. Because the target text corresponds to the target language, the automation of text annotation of the image under a certain specific language (such as cantonese) is realized, and the sample annotation efficiency is improved; the whole text labeling process does not need manual identification and input of labeled contents any more, and when the images containing the texts in the Xiaozhong languages such as Guangdong language texts are labeled, the images are not disturbed by language differences any more, so that the limitation of manual capability is broken through, and the text labeling of the images can be realized without understanding the texts.

After the image annotation sample is obtained, an optical character recognition model can be trained based on the image annotation sample, so that the optical character recognition model can perform text recognition on an input image to be recognized containing a text of a target language, and output a recognition text corresponding to the target language. Exemplarily, for an image to be recognized containing a cantonese text, the image to be recognized is input into a trained optical character recognition model, the image to be recognized is subjected to text recognition through the optical character recognition model, a recognition text corresponding to cantonese is obtained and output, and therefore the image-to-character processing is achieved through OCR.

By applying the embodiment of the invention, the image to be marked containing the text is obtained by screenshot of the page containing the text of the target language, and the optical character recognition text of the image to be marked is obtained by utilizing the optical character recognition technology; then, text extraction is carried out on the screenshot page to obtain a target language text of the page, then a target text corresponding to the optical character recognition text is searched in the target language text, and text labeling is carried out on the image to be labeled based on the target text to obtain an image labeling sample; therefore, in the process of text annotation of the whole image, the content to be annotated does not need to be manually identified and input, the automation of text annotation of the image is realized, the annotation efficiency of the sample is improved, a large number of annotation samples are provided for model training in a short time, and the difficulty in the model training is relieved.

An exemplary application of the embodiments of the present invention in a practical application scenario will be described below. Taking the target language as the cantonese language as an example, the intelligent labeling method for the text in the image provided by the embodiment of the invention is continuously explained. Referring to fig. 7 and fig. 8, fig. 7 is a data flow diagram of an intelligent labeling method for a text in an image according to an embodiment of the present invention, and fig. 8 is a schematic flow chart of the intelligent labeling method for a text in an image according to an embodiment of the present invention, where the intelligent labeling method for a text in an image according to an embodiment of the present invention includes:

step 801: and the terminal responds to the text labeling instruction and sends a text labeling request to the server.

Step 802: and responding to the text labeling request by the server, simulating a browser through an automatic testing tool, and opening a page corresponding to the target website based on the browser obtained through simulation, wherein the page comprises the text of the target language.

Here, the target language may be a cantonese language.

In practical applications, because the cantonese texts are few, a large number of related websites and webpages containing the cantonese texts need to be acquired, and information such as websites and URLs of the websites and webpages is reserved. When the image to be annotated is obtained, the image to be annotated can be obtained by screenshot of the webpage with the cantonese text. In some embodiments, the browser may be driven by an automated testing tool, which simulates operations of the browser to open a website of a target website, browse a corresponding web page, and the like.

Here, referring to fig. 7 (step 701), based on the saved URL of the web page (i.e., the target website), the Selenium tool is started to simulate the browser, so as to implement an automated simulation of the operations of opening, browsing, etc. the website or web page of the target website.

Step 803: and adjusting the window size of the browser obtained by simulation to the size of the target window.

Here, in an actual application, the window size of the browser and the enlargement factor may be set in advance. In order to make the typesetting and size of the text contained in the simulated page meet the viewing standard of the user so as to obtain the image to be annotated with a proper size, multiple tests show that the window size of the browser is 1400px 900px and the magnification factor is 1.3 most proper, so that the value can be set as the target window size of the browser window. When the browser is simulated by the Selenium tool, the window size of the browser may be adjusted to the target window size, see fig. 7 (step 702), so as to conveniently obtain an image to be annotated with an appropriate size.

Step 804: and circularly rolling the page corresponding to the target website in the window of the browser with the size of the target window to realize the browsing of the page.

Here, the page of the target website needs to be circularly scrolled in the simulation browser with the size of the target window to simulate the page browsing operation, and the screenshot is continuously captured in the circular scrolling, so that all the text screenshots of the web page can be conveniently acquired.

Step 805: and acquiring the first screenshot time and screenshot period corresponding to the last screenshot of the page.

Step 806: when the second screenshot time is determined to reach based on the first screenshot time and the screenshot period, acquiring a browsing state corresponding to the page; and when the representation of the browsing state does not browse to the bottom of the page, performing screenshot on the page to obtain an image to be annotated.

In step 805 and 806, when the browsing process of the page is simulated, screenshot processing is performed on the page.

In some embodiments, a screenshot period may be set, and the screenshot may be performed based on the set screenshot period. Specifically, when each screenshot is performed, a first screenshot time for performing screenshot on the page last time needs to be obtained, and when the first screenshot time is added to the time of the screenshot period, and it is determined that the second screenshot time is reached, that is, the previous screenshot is finished, and that is, the next screenshot is to be started, the browsing state of the current page needs to be determined, referring to fig. 7 (step 703), that is, it is determined whether the browsing has been simulated to reach the bottom of the page at this time. And when the obtained representation of the browsing state does not browse to the bottom of the page, carrying out screenshot on the page again. Thus, the image to be annotated containing the cantonese text is obtained.

Specifically, when it is determined whether the simulated browsing reaches the bottom of the page, the determination may be made by determining whether the sum of the document.

In practical application, the content of each screenshot can be determined by the size of the page which can be presented each time on the screen of the display device instead of setting the screenshot period, that is, each screenshot only needs to be the page presented when the current browser window is displayed in full screen. Specifically, the page can be scrolled circularly by executing JavaScript window.scrollto (0, document.document element.scrolltop +900), and the effect of scrolling the page of the screen size of just one display device is achieved at each time of scrolling, see fig. 7 (step 704-step 705).

Step 807: extracting text contents in the page to obtain original text information, and performing character coding on the original text information to obtain a corresponding coded text.

After the screenshot of the page is completed, judging that the simulated browsing of the page reaches the bottom of the page (see step 703 and step 706 of fig. 7), and at this time, extracting a text of the page to obtain original text information of the page; and performing character coding on the original text, and storing the original text as a coded text file with a preset format.

In practical application, the extraction of TEXT content in a page can be realized through an HTML2TEXT library, and the original TEXT information of the page is obtained; the original text information is character-encoded by the UTF-8 encoding method to obtain and store the UTF-8 encoded text file, see fig. 7 (step 707).

Step 808: and carrying out text cleaning on the coded text to filter the symbols of the target type to obtain the target language text.

Here, with continued reference to fig. 7 (step 708), the target language text of the corresponding page may be obtained by importing the saved encoded text into a pre-written text washing script, and performing text washing on the encoded text through the script to filter out symbols of the target type, such as special symbols, emoticons, and the like.

Step 809: and carrying out optical character recognition on the image to be marked to obtain an optical character recognition text corresponding to the image to be marked.

Here, continuing to refer to fig. 7 (step 709), performing OCR recognition on the intercepted image to be annotated by using an OCR recognition technology, so as to obtain an OCR text corresponding to the image to be annotated.

Step 810: and carrying out text analysis on the optical character recognition text to obtain single-line texts contained in the optical character recognition text.

Here, with continued reference to fig. 7 (step 710), the OCR text is text parsed to obtain individual single-line texts included in the OCR text.

Step 811: and respectively extracting the head and the tail of each single line of text, and taking the extracted head and tail as the text identifications of the corresponding single line of text.

Referring to fig. 7 (step 711), the beginning and ending words of each single line of text are extracted, and the extracted beginning and ending words are used as text identifiers of each single line of text. Here, the keywords of the single-line texts can be extracted as corresponding text identifications.

Step 812: and respectively carrying out word matching on the head and tail words of each single line of text and the text in the target language text so as to obtain a single line of target text corresponding to each single line of text in the target language text based on the matching result.

Here, referring to fig. 7 (step 712), the target language text is searched for a text matching the beginning and ending words of the single line of text based on the beginning and ending words of the single line of text by means of beginning and ending word search, and the line of text matching the beginning and ending words of the single line of text in the target language text is taken as the single line of target text corresponding to the single line of text.

And taking the single-line target text corresponding to each single-line text as the target text corresponding to the OCR text.

Step 813: and determining the target position of the optical character recognition text corresponding to the target text in the image to be marked.

When text annotation is performed on the image to be annotated based on the target text, firstly, a target position of the OCR text corresponding to the target text in the image to be annotated is determined. Specifically, referring to fig. 6, the position coordinates of the region box where each single line of text included in the OCR text is located are determined.

Step 814: and binding the target text with the target position to realize text annotation of the image to be annotated.

Continuing to refer to fig. 7 (step 713), performing text annotation on the image to be annotated through the obtained target text to obtain an image annotation sample. Specifically, based on step 813, when the target text is bound to the target position, each single line of the target text in the target text may be bound to the position coordinates of the corresponding single line of the target text in the image to be labeled. Therefore, the text annotation of the image to be annotated is realized, and the image annotation sample with text information is obtained.

Based on the steps, the automation of text annotation of the image is realized, the annotation efficiency of the sample is improved, and a large number of annotated samples are provided for model training in a short time; meanwhile, because manual identification and annotation content input are not needed any more, when the images containing the texts in the little languages such as the cantonese texts are annotated, the images are not disturbed by language differences, the limitation of manual capability is broken through, the text annotation of the images can be realized without understanding the texts, and the difficulty of model training is relieved.

In practical application, after the image annotation sample is obtained, the optical character recognition model can be trained based on the image annotation sample, so that the optical character recognition model can perform text recognition on an input image to be recognized, which contains a text of a target language, and output a recognition text corresponding to the target language.

Specifically, for an image to be recognized containing a cantonese text, the image to be recognized is input into a trained optical character recognition model, the image to be recognized is subjected to text recognition through the optical character recognition model, a recognition text corresponding to cantonese is obtained and output, and therefore the image-to-character processing is achieved through OCR.

Continuing with the description of the intelligent annotation device 355 for text in images provided by the embodiment of the present invention, in some embodiments, the intelligent annotation device for text in images can be implemented by means of software modules. Referring to fig. 9, fig. 9 is a schematic structural diagram of the intelligent labeling device 355 for text in image according to an embodiment of the present invention, and the intelligent labeling device 355 for text in image according to an embodiment of the present invention includes:

a screenshot module 3551, configured to perform screenshot on a page including text content to obtain an image to be annotated including the text content; the text content is matched with a target language;

a text extraction module 3552, configured to extract text content in the page to obtain a target language text of the page;

the recognition module 3553 is configured to perform optical character recognition on the image to be labeled to obtain an optical character recognition text corresponding to the image to be labeled;

an obtaining module 3554, configured to obtain a target text corresponding to the optical character recognition text in the target language text;

and the labeling module 3555 is configured to perform text labeling on the image to be labeled based on the target text to obtain an image labeling sample, where the image labeling sample is used for training an optical character recognition model, so that the optical character recognition model obtained through training can perform text recognition on the input image to be recognized, which includes the text in the target language, and output a recognition text corresponding to the target language.

In some embodiments, the screenshot module 3551 is further configured to simulate a browsing process of the page based on an automated testing tool;

In some embodiments, the screenshot module 3551 is further configured to simulate a browser through the automated testing tool, and open a page corresponding to the target website based on the simulated browser;

In some embodiments, the screenshot module 3551 is further configured to obtain a first screenshot time and a screenshot period corresponding to a last screenshot performed on the page;

In some embodiments, the text extracting module 3552 is further configured to extract text content in the page to obtain original text information;

In some embodiments, the obtaining module 3554 is further configured to perform text parsing on the optical character recognition text to obtain single lines of text included in the optical character recognition text;

In some embodiments, the obtaining module 3554 is further configured to extract a beginning word and an end word of each single line of text, and use the extracted beginning word and end word as text identifiers of the corresponding single line of text;

In some embodiments, the annotation module 3555 is further configured to determine a target position of an optical character recognition text corresponding to the target text in the image to be annotated;

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories. The computer may be a variety of computing devices including intelligent terminals and servers.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. An intelligent labeling method for texts in images is characterized by comprising the following steps:

2. The method of claim 1, wherein the screenshot of the page containing the text content to obtain the image to be annotated containing the text content comprises:

simulating a browsing process of the page based on an automatic test tool;

3. The method of claim 2, wherein the simulating the browsing process of the page based on the automated testing tool comprises:

simulating a browser through the automatic testing tool, and opening a page corresponding to a target website based on the browser obtained through simulation;

4. The method of claim 2, wherein capturing the screenshot of the page during the simulated browsing to obtain the image to be annotated comprises:

acquiring first screenshot time and screenshot period corresponding to the last screenshot of the page;

5. The method of claim 1, wherein the extracting the text content in the page to obtain the target language text of the page comprises:

extracting text contents in the page to obtain original text information;

6. The method of claim 1, wherein said obtaining target text in the target language text corresponding to the optical character recognition text comprises:

performing text analysis on the optical character recognition text to obtain single-line texts contained in the optical character recognition text;

7. The method as claimed in claim 6, wherein said obtaining the text identification corresponding to each single line of text comprises:

respectively extracting the head and the tail of each single-line text, and taking the extracted head and tail as text identifications of the corresponding single-line text;

8. The method of claim 1, wherein the text labeling of the image to be labeled based on the target text comprises:

determining a target position of an optical character recognition text corresponding to the target text in the image to be marked;

9. An apparatus for intelligent annotation of text in an image, the apparatus comprising:

10. A computer-readable storage medium having stored thereon executable instructions for, when executed, implementing a method for intelligent annotation of text in images according to any one of claims 1 to 8.