CN112882678B

CN112882678B - Image-text processing method, image-text processing display method, image-text processing device, image-text processing equipment and storage medium

Info

Publication number: CN112882678B
Application number: CN202110276188.2A
Authority: CN
Inventors: 龙云翔; 姚刚
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2024-04-09
Anticipated expiration: 2041-03-15
Also published as: CN112882678A

Abstract

The application discloses a graphic processing method, a graphic processing display method, a graphic processing device, a graphic processing equipment and a graphic display storage medium, which relate to the technical field of image processing, in particular to an artificial intelligence technology and a computer vision technology. The specific implementation scheme is as follows: cutting the target pictures of the picture-text mixed arrangement according to the display size of the screen to form at least two sub-pictures; performing character recognition on the text in the sub-picture to obtain a character recognition result; establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen. According to the technical scheme, pictures of mixed arrangement of pictures and texts can be effectively processed, and the matching display effect of the text recognition result is provided.

Description

Image-text processing method, image-text processing display method, image-text processing device, image-text processing equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technology, and in particular, to artificial intelligence techniques and computer vision techniques.

Background

Intelligent terminal devices have been widely used in life and work of people, and can be used for browsing various forms of information, such as text, audio, video, pictures, etc. In various applications, different format types of information may also be provided to the user as desired.

For the information of the picture type, there is a problem in that it is necessary to adapt the screen size of the terminal device. For example, when a picture is displayed on a small screen terminal device, the picture is typically compressed or otherwise adjusted to fit the small screen due to the small screen size.

However, by compressing the picture, the user cannot see the content information in the picture clearly, and the browsing experience is poor. If the typesetting of the picture content is readjusted, additional data processing amount is increased, and a large amount of picture information to be displayed is difficult to deal with.

Disclosure of Invention

The disclosure provides a graphic processing method, a graphic processing device, a graphic display method, a graphic display device, graphic processing equipment and a storage medium.

According to an aspect of the present disclosure, there is provided an image-text processing method, including:

cutting the target pictures of the picture-text mixed arrangement according to the display size of the screen to form at least two sub-pictures;

Performing character recognition on the text in the sub-picture to obtain a character recognition result;

establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen.

According to another aspect of the present disclosure, there is provided a graphic display method, applied to a client, the method including:

loading a target picture of mixed image-text arrangement, a text recognition result of the target picture and a corresponding position relation between the text recognition result and the target picture;

and displaying the target picture in a rolling manner in a screen of a terminal where the client is located, and displaying the text recognition result according to the corresponding position relation in the process of displaying the target picture in a rolling manner.

According to another aspect of the present disclosure, there is provided a graphic processing apparatus, including:

the picture cutting module is used for cutting pictures of the target pictures of the picture-text mixed arrangement according to the display size of the screen so as to form at least two sub-pictures;

the character recognition module is used for recognizing characters of the texts in the sub-pictures to obtain character recognition results;

The position relation establishing module is used for establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen.

According to another aspect of the present disclosure, there is provided an image-text display device configured at a client, the device including:

the data loading module is used for loading the target pictures of the graphic and text mixed arrangement, the text recognition results of the target pictures and the corresponding position relation between the text recognition results and the target pictures;

the data display module is used for displaying the target picture in a rolling mode in a screen of the terminal where the client is located, and displaying the text recognition result according to the corresponding position relation in the process of displaying the target picture in a rolling mode.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of image processing provided by any embodiment of the present disclosure or the methods of image presentation provided by any embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the image processing method provided by any embodiment of the present disclosure or the image presentation method provided by any embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the image processing method provided by any embodiment of the present disclosure or implements the image presentation method provided by any embodiment of the present disclosure.

According to the technical scheme, pictures of mixed arrangement of pictures and texts can be effectively processed, and the matching display effect of the text recognition result is provided.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of an image-text processing method provided in an embodiment of the present application;

Fig. 2A is a flowchart of another image-text processing method according to an embodiment of the present application;

fig. 2B is a schematic diagram of a target picture applicable to an embodiment of the present application;

FIG. 3 is a flowchart of another image-text processing method according to an embodiment of the present disclosure;

fig. 4 is a flowchart of another image-text processing method according to an embodiment of the present application;

fig. 5 is a flowchart of an image-text display method provided in an embodiment of the present application;

fig. 6 is a block diagram of an image-text processing device according to an embodiment of the present application;

fig. 7 is a block diagram of an image-text display device according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of an image-text processing method provided in an embodiment of the present application, where the embodiment is suitable for processing images of mixed images of images and texts, so as to adapt to display of the images on a screen of a terminal device. The technical scheme of the embodiment of the application is particularly suitable for the situation that long pictures are written on a small-size screen. The present embodiment may be implemented by a graphics processing apparatus implemented in hardware and/or software, which may be configured on an electronic device having data processing capabilities. The electronic equipment can be a server, namely, the server pre-processes the picture and then loads and displays the picture by the terminal. Or the electronic equipment can also be a terminal, and the terminal processes the long picture loaded locally and then displays the long picture.

As shown in fig. 1, the method includes:

s110, cutting the target pictures of the graphic and text mixed arrangement according to the display size of the screen to form at least two sub-pictures;

the screen display size refers to the display window size of the terminal screen for displaying the picture content, is generally rectangular, and can be represented by a length-width size. According to the habit of viewing the content by a user, the user always scrolls up and down in a vertical screen state to view the content in the display window, and the picture is also in a long picture mode, namely the picture length is far longer than the screen height, and the user browses the picture by scrolling up and down. Therefore, the screen presentation size mainly considers the height size of the screen. Of course, those skilled in the art will appreciate that other dimensions of screen presentation dimensions, such as width, may be considered if scrolling in the lateral direction.

The embodiment of the application is mainly aimed at the samples of image-text combination of target images such as advertisement images, cartoon images and the like.

Before character recognition is required to be performed on the target picture, the recognition area is limited, so that the picture cutting process is required to be performed first, and the target picture is cut into a plurality of sub-pictures. Specifically, the target picture can be segmented from the height dimension according to the height of the screen display size, and a certain correlation exists between the heights of the sub-pictures and the height of the screen display size, and the sub-pictures can be slightly larger, the same or slightly smaller.

Optionally, the graph cutting operation specifically includes: and in the loading process of the target picture, cutting the loaded part of the target picture according to the screen display size to form at least two sub-pictures.

That is, in the technical solution of the present embodiment, the graph cutting can be performed during the loading process of the target image, and the graph cutting is not required to be performed after the target image is added. The rule of cutting the image may be cut according to the size of the target image itself, for example, dividing equally according to the total length of the target image. In this embodiment, the splitting rule is preferably determined according to the screen display size, so long as the loaded portion of the target picture meets the splitting condition, the splitting can be performed without waiting for the completion of loading all the pictures. The technical scheme is particularly suitable for the situation that the client loads the picture from the server for display. The client loads the pictures from the server, or may load the pictures while displaying the pictures, if the user browses to the middle part and is not interested, the user may stop browsing and exit, and at this time, the loading may be stopped. The processing is carried out in the loading process, so that pictures can be provided for the client to be displayed more timely, and the completion of all loading is not required to be waited.

In this embodiment, there are a plurality of specific image cutting rules, and optionally, performing image cutting processing on the target image of the mixed image-text row according to the display size of the screen to form at least two sub-images includes:

cutting the target pictures of the image-text mixed arrangement according to the cut height of the slice determined based on the screen display size to form at least two sub-pictures;

and the slice interception height is a set value which is larger than, equal to or smaller than the screen display size.

The above operation is picture cropping, and after the original target picture is obtained, especially the long picture, picture cropping and compression are needed to improve the end loading speed and match the small screen size. I.e. a cartoon, is formed by splicing N sub-pictures in sequence. The slice intercept heights may be the same or different during a single cut.

S120, performing text recognition on the text in the sub-picture to obtain a text recognition result;

and carrying out character recognition on each split sub-picture to obtain characters. The text-based text can be directly determined as a text recognition result, and the text recognition result can be obtained after further processing according to the requirement. For example, the text may be further converted into audio, or may be used as a text recognition result. There are two ways of text-to-audio (TTS): 1. and (5) offline transferring and storing in advance. 2. And synthesizing and playing the voice in real time according to the browsing condition of the picture.

Specifically, performing text recognition on the text in the sub-picture to obtain a text recognition result includes:

performing character recognition on the text in the sub-picture to obtain recognized characters;

and performing voice conversion on the recognition text to obtain the text recognition result in a voice form.

There are various character recognition means, for example, optical character recognition (Optical Character Recognition, abbreviated as OCR) means. The character recognition interface provided by other programs may be invoked to accomplish the character recognition function. If the OCR software is invoked, the OCR software interface returns the corresponding line coordinates (i.e., the top left corner coordinates of the rectangle in which the currently identified line is located) along with the text contained in the line. OCR recognition can be performed by adopting sub-pictures of the original image, and the pictures can be compressed according to proper resolution to adapt to small-screen equipment, so that data flow is saved.

For text recognition, text boxes in a sub-picture are typically first determined, such as a line text box where the text is located, a minimum rectangular box surrounding the text, or other outline text boxes surrounding the text. Text that is farther away will be divided into different text boxes. And then identifying the characters in the text box.

S130, establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture;

and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen.

The text corresponding to the text recognition result may be an absolute position coordinate or a relative position relationship in the target picture, which is not limited in the embodiment of the present application, so long as the position relationship between the original text and the target picture can be indicated. For example, if the text recognition result corresponds to a text box, the corresponding positional relationship may be the position of the text box in the target picture.

The character recognition result is used for being matched for displaying when the target picture is displayed in the screen, and the character recognition result is displayed when the target picture is rolled to the corresponding position, so that a user can synchronously acquire the character recognition result while browsing the target picture, and the effect of browsing the target picture by the user is effectively supplemented or enhanced.

The technical scheme of the embodiment of the application is particularly suitable for the terminal equipment with a small-size (for example, the screen size is lower than 3 inches). For example, smart watches, child smart toy devices, etc., wearable small screen smart devices, whose display screen is typically smaller than the cell phone screen. If the width of the picture is consistent with the width of the screen when the long picture is displayed on the small-size screen, the picture is often reduced, and the user cannot see the picture clearly. Too long a picture is inconvenient for the user to zoom to view the text. If the long pictures are rearranged to be suitable for the screen size, the calculation amount for respectively rearranging the large number of long pictures to be displayed is extremely large, and the requirement for efficiently and rapidly displaying the pictures is not met. According to the technical scheme, the target picture is adapted to the width size of the screen for rolling display, and the character recognition result is synchronously displayed in the rolling process, so that additional subtitles can be displayed, or the generated voice of characters can be played. Therefore, the effect of assisting the user in knowing the text content of the picture can be achieved. According to the technical scheme, the normalized scheme of cutting pictures, identifying characters and corresponding positions is adopted for various long pictures, the universal applicability is achieved for the long pictures of various contents, the processing amount is low, and the long pictures can be rapidly processed to be provided for users.

When the server side provides loading and displaying services of the target picture for the client side, the server side can execute the scheme of the embodiment in advance, and the target picture is preprocessed to form a character recognition result; or the server side can provide a processing process of character recognition in real time in the loading process to form a character recognition result; the technical scheme of the embodiment of the application can be completed by the image-text processing plug-in unit configured in the client, and when the client loads the target image to the local, the local image-text processing plug-in unit is called to carry out image cutting, character recognition and position matching, so that the character recognition result is displayed in the image browsing process.

Fig. 2A is a flowchart of another graphics processing method according to an embodiment of the present application, where the embodiment is based on the technical solution provided in the foregoing embodiment, and further describes an alternative implementation of the graphics cutting operation process.

In this embodiment, there is an overlapping redundant area before two adjacent sub-pictures; the redundant area is located at the upper edge and/or the lower edge of the sub-picture. For a split sub-picture, there may be some text spanning two sub-pictures, as shown in fig. 2B, splitting a long picture into 4 sub-pictures, the text "watch cartoon" spanning the 1 st and 2 nd sub-pictures. If the pictures are cut in this way, it is difficult to correctly recognize the text content of the "watch cartoon" in both the 1 st and 2 nd sub-pictures. Therefore, in this embodiment, a redundant area is provided, and at the upper edge and/or the lower edge of the sub-picture, the redundant area is provided, and overlaps with the adjacent sub-picture. As shown in fig. 2B, there is a redundant area (20 pixels) at the lower edge of the 1 st sub-picture, which belongs to the 1 st sub-picture. Therefore, the 1 st sub-picture completely comprises the characters of the 'watch cartoon', so that the characters can be accurately identified.

As shown in fig. 2A, the present embodiment includes:

s210, determining the size of the redundant area according to the content type and/or the text font size of the target picture;

the redundant area is set to avoid that the text spans two sub-pictures, so the redundant area can be set according to the size of the text font. Since the text in a picture may have various sizes, it is possible to comprehensively consider, for example, setting the size of the redundant area according to the maximum text font. The content type of the target picture can also indirectly reflect the text font size. For example, the text font size in a caricature is generally distinguishable from the text font size in a child training material. Therefore, the size of the redundant area may also be set according to the content type of the target picture. The content type and/or text font size may be dynamically identified after the picture is obtained, but preferably the target picture will typically have a preset content tag, from which the size of the redundant area may be determined directly.

S220, cutting the target pictures of the graphic mixed arrangement according to the display size of the screen to form at least two sub-pictures;

s230, performing text recognition on the text in the sub-picture to obtain a text recognition result;

S240, establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen.

According to the technical scheme, the accuracy of character recognition can be fully considered, and the redundant area is reserved.

Fig. 3 is a flowchart of another image-text processing method according to an embodiment of the present application, where the embodiment is based on the foregoing embodiment, and further describes a determining manner of determining a corresponding position relationship between a text recognition result and a target picture. The embodiment comprises the following steps:

s310, cutting the target pictures of the graphic mixed arrangement according to the display size of the screen to form at least two sub-pictures;

s320, performing text recognition on the text in the sub-picture to obtain a text recognition result;

and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of displaying the target picture in a rolling way in the screen.

S330, according to the positions of the sub-pictures in the target picture, adjusting the absolute coordinate positions of the texts of the text recognition results in the sub-pictures to be the absolute coordinate positions in the target picture;

For a target picture, coordinate values may be used to identify the location of content in the picture. A starting point position of the coordinate system may be generally set, for example, at the upper left corner of the target icon, as a (0, 0) point, and the coordinate unit may be a pixel point. Referring to FIG. 2B, the x-axis of the coordinate system is to the left along the starting point and the y-axis is down along the starting point. The absolute coordinate positions of the picture everywhere can be identified by the coordinate values of the coordinate system. The position of each sub-picture in the target picture can be expressed by the absolute coordinate position of the upper left corner of the sub-picture and the length of the sub-picture. For example, as shown in fig. 2B, the position of the 2 nd sub-picture is (0,60), the upper left corner is (0,60), and the length is 60 pixels.

Based on the text recognition result recognized by the sub-picture, the position thereof is relative to the sub-picture, for example, as shown in fig. 2B, the text recognition result of the "broken coordinates" whose upper left corner coordinates of the text box in the 2 nd sub-picture are (10, 10), which are coordinates determined with the upper left corner of the 3 rd sub-picture as the origin of (0, 0). The absolute coordinate position of the word recognition result in the target picture may be adjusted to (10, 130) based on the absolute coordinate position of the word recognition result in the sub-picture (10, 10) and the position of the 3 rd sub-picture in the target picture (0, 120). Therefore, each character recognition result is equivalent to the target picture with uniform coordinate position expression.

S340, clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture;

clustering of text based on semantics is also required before determining positional relationships. For example, OCR recognizes text in units of lines, without semantic concepts. It is necessary to aggregate the words of the line units into semantically related sentences. Some sentences may occupy multiple rows. In the picture of the picture-text mixture, the positions of the text boxes are random, for example, two text boxes can be arranged side by side and are different sentences, so that clustering processing is needed. For example, a bubble of dialog in a cartoon is text of different semantics.

S350, determining the attribution relation between the clustered text recognition result and the sub-picture according to the absolute coordinate position of the text of the clustered text recognition result in the target picture, and taking the attribution relation as the corresponding position relation.

The clustered text boxes may be rectangular or the like, and the absolute coordinate positions of the clustered text boxes in the target picture can be expressed through size information such as the upper left corner of the rectangle, the height, the width and the like. And further, according to the absolute coordinate position, which sub-picture the character recognition result belongs to can be determined. Optionally, determining that the text of the clustered text recognition result belongs to the sub-picture with the largest occupied area according to the absolute coordinate position of the text of the clustered text recognition result in the target picture. Alternatively, it may be determined that a text box belongs to a sub-picture when the text box occupies more than one half of its total area in the sub-picture.

After determining the sub-picture to which the character recognition result belongs, the character recognition result can be used for displaying the sub-picture in the target picture in a rolling way in a screen, and the character recognition result belonging to the currently displayed sub-picture is displayed. When the user scrolls in the screen to browse the target picture, if the user monitors that a certain sub-picture enters the screen range, the text recognition result corresponding to the sub-picture is displayed. The area of the sub-picture entering the screen reaches a set proportion, for example, one half, and can be regarded as entering the screen range.

Text located within the redundant area may be included in both of the adjacent sub-pictures. For example, as in the "watch cartoon" in fig. 2B, the "watch cartoon" belongs to both the 1 st sub-picture and the 2 nd sub-picture, and the text including the "watch cartoon" is repeated in the text recognition result of the two sub-pictures. At this time, two text recognition results for recognizing the "watch cartoon" may be generated, and after absolute coordinate positions of text boxes where the two text recognition results are located are respectively determined, a deduplication process may be performed based on the absolute coordinate positions of the text before or after clustering. When the outline and the position of two text boxes are highly similar, then the text boxes are considered to be the same, and the text recognition result of one repeated text box can be removed.

According to the technical scheme, the positions of the character recognition results can be effectively expressed and adjusted to perform semantic-based clustering and sub-picture attribution, so that the character recognition results are synchronously displayed when the sub-pictures are displayed in a rolling mode.

Fig. 4 is a flowchart of another image-text processing method according to an embodiment of the present application, where the embodiment is based on the foregoing embodiment, and further describes an implementation scheme for semantic clustering of text. The method comprises the following steps:

s410, cutting the target pictures of the graphic mixed arrangement according to the display size of the screen to form at least two sub-pictures;

s420, performing character recognition on the text in the sub-picture by taking the set unit area as an object to obtain recognition characters in the set unit area;

wherein the set unit area is a row, a column or an area with a set shape and size. Optionally, determining a text box in the sub-picture by a row unit, and performing character recognition.

S430, adjusting the absolute coordinate position of the text recognition result in the sub-picture to be the absolute coordinate position in the target picture according to the position of each sub-picture in the target picture;

S440, clustering at least two set unit areas according to the absolute coordinate position of the text recognition result in the target picture;

the line is taken as a set unit area for example for explanation, the text recognition result at this time is a line-to-line text, and the line text whose distance meets the set requirement can be clustered based on the position of the line text in the target picture. In general, closer distances are understood to be concentrated meaning expressions.

Optionally, clustering at least two set unit areas according to the absolute coordinate position of the text recognition result in the target picture includes:

traversing the text recognition results of each set unit area according to the absolute coordinate position of the text recognition result in the target picture from the set starting point position in the target picture;

clustering the set unit areas with the interval distance of the set direction smaller than the distance threshold value; wherein the set direction includes at least one of a lateral direction, a vertical direction, and an oblique direction.

For example, in the above scheme, line texts with smaller interval distance may be clustered. The distance direction may be at least one of a lateral direction, a vertical direction, and an oblique direction. That is, one sentence can be considered to be made up of several lines of the identified text that are closest in line distance.

S450, merging the identification characters in each set unit area of the clusters, and obtaining a character identification result according to the merged identification characters;

for example, the text lines of all OCR results are combined into a KD tree (K-dimensional tree) according to the restored absolute coordinate positions. The KD-tree is a high-dimensional indexed tree data structure for Nearest Neighbor lookups (Nearest Neighbor) and approximate Nearest Neighbor lookups (Approximate Nearest Neighbor) in a large-scale high-dimensional data space. The set origin position may be the origin of coordinates (0, 0). Finding the nearest point (for example, the upper left corner position of the text frame where the text recognition result is located) from the (0, 0) point as the first line a of the first sentence S1 (where a is also the last line of S1), taking a as the current line to find the nearest line b, and if the distance between the line a and the line b does not exceed the distance threshold t (for example, t=15px (pixel) can be set according to the cartoon font size adjustment), considering that the line b also belongs to the sentence S1, where the line b is the last line of the sentence S1. And so on until the line spacing nearest to the last line of sentence S1 is greater than t or there are no more lines, then sentence S1 recognition is deemed to end. This process is repeated until all text line nodes on the KD-tree are found, and all sentences can be obtained. All sentences are then ordered (the first row coordinates of a sentence being the coordinates of the belonging sentence), typically in a top-to-bottom, left-to-right order.

S460, determining the attribution relation between the clustered text recognition result and the sub-picture according to the absolute coordinate position of the text of the clustered text recognition result in the target picture, and taking the attribution relation as the corresponding position relation.

According to the technical scheme, the scattered texts can be clustered, and the display process of the sub-pictures is conveniently matched for centralized display.

Based on the determined text recognition result and the corresponding position relationship corresponding to the target picture, summarizing may be performed, and the text recognition result may be converted into structured data, which may include: the length and width of a slice of a sub-picture of a character recognition result (caption or voice audio), the sub-picture to which the character recognition result belongs, and the positioning coordinates of the character recognition result in a target picture. The structured data can be stored together with the target picture or separately for loading and displaying by a client.

Fig. 5 is a flowchart of an image-text display method according to an embodiment of the present application. The image-text display method provided by the embodiment of the application is applied to the client, and is suitable for the situation that the client displays the processed image-text mixed-arrangement pictures together with the text recognition result. The embodiment can be implemented by a graphic display device, which can be implemented by hardware and/or software, and can be configured in a terminal device as a client or a client plug-in. As shown in fig. 5, the method includes:

S510, loading a target picture of image-text mixed arrangement, a text recognition result of the target picture and a corresponding position relation between the text recognition result and the target picture;

in this embodiment, the text recognition result and the corresponding position relationship may be obtained by using the image-text processing method provided in the embodiment of the present application. The text recognition result and the corresponding position relation can be generated at the server side, and the text recognition result and the corresponding position relation can be generated at the client side.

When the client side needs to display the target picture under the control of the user, the data of the target picture is loaded, and at the moment, the text recognition result and the corresponding position relation can be loaded simultaneously or asynchronously.

And when loading the target pictures of the graphic mixed arrangement through the third party application, invoking a graphic processing plug-in unit to process the target pictures so as to generate a character recognition result of the target pictures and a corresponding position relation between the character recognition result and the target pictures. The client is, for example, a third party application program, and the third party application program may be configured with a plug-in unit capable of performing graphics processing, or the third party application program may call the plug-in unit with graphics processing installed in the terminal to generate a text recognition result and a corresponding position relationship, and display the text recognition result and the corresponding position relationship.

Preferably, the client can scroll and display the loaded part of the target picture in the screen of the terminal where the client is located in the process of loading the target picture. That is, when the client does not complete the target picture, the client can display the target picture.

S520, displaying the target picture in a rolling mode in a screen of a terminal where the client is located, and displaying the text recognition result according to the corresponding position relation in the process of displaying the target picture in a rolling mode.

The above operations are optionally:

in the process of rolling and displaying the target picture, determining a corresponding character recognition result according to the corresponding position relation;

and displaying the text recognition result in a set subtitle area in the screen and/or playing the text recognition result in a voice form.

In a specific operation process, when the target picture is displayed in a rolling way, the position relation of the sub picture or the absolute coordinate position in the target picture relative to the screen can be determined. When the sub-picture or absolute coordinate position meets the set display position condition, determining a corresponding character recognition result according to the corresponding position relation, and displaying. For example, a sub-picture entry screen may be considered when the area of the sub-picture entry screen is greater than one-half the area of the sub-picture.

The text recognition result may be displayed in a set subtitle region, which may or may not overlap with the target picture. For example, for the overlapping case, the text recognition result may be displayed as a bullet screen superimposed on the target picture. At this time, the font size of the characters in the character recognition result can be controlled to be in accordance with the browsing habit of the user and be larger than the set font size threshold.

The text recognition result in the form of voice can also be played. The caption presentation and the voice playback can be realized simultaneously. For the purpose of showing the effect, and pauses between sentences during speech synthesis, all sentences of each sub-picture are separated by a line-feed.

According to the technical scheme, the pictures mixed with the pictures and texts can be automatically cut into long pictures, the texts are automatically identified and encoded into real-time captions, and the positions of the texts are positioned to realize the scheme of automatically displaying the captions and the text contents of the current picture region along with the browsing of the pictures. The technical scheme of the embodiment of the application enables the existing picture-text mixed long pictures, such as cartoon picture resources, to be directly accessed into a client for browsing without manual secondary editing.

Fig. 6 is a block diagram of an image-text processing device according to an embodiment of the present application, where the device may be adapted to the image-text processing method according to the embodiment of the present application, and has corresponding functions and beneficial effects. The device comprises: a graph cutting module 610, a word recognition module 620 and a positional relationship establishing module 630.

The image cutting module 610 is configured to cut the target image of the mixed image and text according to the display size of the screen, so as to form at least two sub-images;

the text recognition module 620 is configured to perform text recognition on the text in the sub-picture to obtain a text recognition result;

a position relationship establishing module 630, configured to establish a corresponding position relationship between a text recognition result and a text position of the text in the target picture; and the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen.

Optionally, the graph cutting module is specifically configured to:

and in the loading process of the target picture, cutting the loaded part of the target picture according to the screen display size to form at least two sub-pictures.

Optionally, the graph cutting module is specifically configured to:

Optionally, overlapping redundant areas exist before two adjacent sub-pictures; the redundant area is located at the upper edge and/or the lower edge of the sub-picture.

Optionally, the apparatus further comprises:

and the redundant area determining module is used for determining the size of the redundant area according to the content type and/or the text font size of the target picture before carrying out picture cutting processing on the target picture of the mixed image and text arrangement according to the display size of the screen so as to form at least two sub-pictures.

Optionally, the positional relationship establishing module includes:

the position adjusting unit is used for adjusting the absolute coordinate position of the text recognition result in the sub-picture to be the absolute coordinate position in the target picture according to the position of each sub-picture in the target picture;

the clustering unit is used for clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture;

and the attribution determining unit is used for determining attribution relation between the clustered text recognition result and the sub-picture as the corresponding position relation according to the absolute coordinate position of the text of the clustered text recognition result in the target picture.

Optionally, the text recognition module is specifically configured to: performing character recognition on the text in the sub-picture by taking the set unit area as an object to obtain recognition characters in the set unit area; wherein the set unit area is a row, a column or an area with a set shape and size;

correspondingly, the clustering unit comprises:

the region clustering subunit is used for clustering at least two set unit regions according to the absolute coordinate position of the text recognition result in the target picture;

and the character merging subunit is used for merging the identification characters in each set unit area of the clusters and obtaining a character identification result according to the merged identification characters.

Optionally, the area clustering subunit is specifically configured to:

Optionally, the attribution determining unit is specifically configured to:

and determining that the text of the clustered text recognition result belongs to the sub-picture with the largest occupied area according to the absolute coordinate position of the text of the clustered text recognition result in the target picture.

Optionally, the text recognition result is used for displaying the text recognition result belonging to the currently displayed sub-picture when the sub-picture displayed in the target picture is scrolled in the screen.

Optionally, the text recognition module is specifically configured to:

Optionally, the device is configured at a server or is a graphics processing plug-in configured in a client.

According to the technical scheme, the normalized scheme of cutting pictures, identifying characters and corresponding positions is adopted for various long pictures, the universal applicability is achieved for the long pictures of various contents, the processing amount is low, and the long pictures can be rapidly processed to be provided for users.

Fig. 7 is a block diagram of an image-text display device according to an embodiment of the present application, where the device may be configured at a client, and may implement the image-text display method according to the embodiment of the present application, and has corresponding functions and beneficial effects. The device comprises: the data loading module 710 and the data presentation module 720.

The data loading module 710 is configured to load a target picture of mixed graphics and text, a text recognition result of the target picture, and a corresponding positional relationship between the text recognition result and the target picture;

the data display module 720 is configured to scroll display the target picture on a screen of the terminal where the client is located, and display the text recognition result according to the corresponding position relationship in the process of scroll displaying the target picture.

Optionally, the data loading module is specifically configured to:

and in the process of loading the target picture, rolling and displaying the loaded part of the target picture in a screen of the terminal where the client is located.

Optionally, the data display module is specifically configured to:

Optionally, the data loading module is specifically configured to:

and when loading the target pictures of the graphic mixed arrangement through the third party application, invoking a graphic processing plug-in unit to process the target pictures so as to generate a character recognition result of the target pictures and a corresponding position relation between the character recognition result and the target pictures.

According to the technical scheme, pictures of mixed arrangement of pictures and texts can be browsed in a matched mode according to the text recognition result, and browsing requirements of users are met.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a graphic processing method or a graphic presentation method. For example, in some embodiments, the teletext processing method or the teletext presentation method can be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the teletext processing method or the teletext presentation method described above can be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a teletext processing method or a teletext presentation method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A picture and text processing method comprises the following steps:

establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen;

The establishing a corresponding position relationship between a text recognition result and a text position of the text in the target picture comprises the following steps: according to the positions of the sub-pictures in the target picture, adjusting the absolute coordinate positions of texts of the text recognition results in the sub-pictures to be the absolute coordinate positions in the target picture; clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture; determining the attribution relation between the clustered text recognition result and the sub-picture as the corresponding position relation according to the absolute coordinate position of the text of the clustered text recognition result in the target picture;

the text in the sub-picture is subjected to text recognition, so that a text recognition result is obtained, wherein the text recognition result comprises: performing character recognition on the text in the sub-picture by taking the set unit area as an object to obtain recognition characters in the set unit area; wherein the set unit area is a row, a column or an area with a set shape and size;

the clustering processing of the text recognition result according to the absolute coordinate position of the text recognition result in the target picture comprises the following steps: clustering at least two set unit areas according to the absolute coordinate position of the text recognition result in the target picture; and combining the identification characters in each set unit area of the clusters, and obtaining a character identification result according to the combined identification characters.

2. The method of claim 1, wherein slicing the target picture of the teletext according to the screen presentation size to form at least two sub-pictures comprises:

3. The method of claim 1 or 2, wherein performing a cut-out process on the target tiles of the teletext to form at least two sub-pictures according to the screen presentation size comprises:

4. The method according to claim 1 or 2, wherein two adjacent sub-pictures are preceded by overlapping redundant areas; the redundant area is located at the upper edge and/or the lower edge of the sub-picture.

5. The method of claim 4, wherein before performing a cropping process on the target pictures of the teletext to form at least two sub-pictures according to the screen presentation size, further comprising:

And determining the size of the redundant area according to the content type and/or the text font size of the target picture.

6. The method of claim 1, wherein clustering at least two set unit areas according to absolute coordinate positions of text of the text recognition result in the target picture comprises:

7. The method of claim 1, wherein determining, as the corresponding positional relationship, a attribution relationship of the clustered text recognition result and the sub-picture according to an absolute coordinate position of the text of the clustered text recognition result in the target picture comprises:

8. The method of claim 1, wherein the text recognition result is used to display text recognition results attributed to a currently displayed sub-picture when scrolling the sub-picture displayed in the target picture in a screen.

9. The method of claim 1, wherein performing text recognition on the text in the sub-picture to obtain a text recognition result comprises:

10. The method of claim 1, wherein the method execution body is a graphics processing plug-in configured in a server or a client.

11. A graphic presentation method applied to a client, the method comprising:

the target picture is displayed in a rolling mode in a screen of a terminal where the client is located, and in the process of displaying the target picture in a rolling mode, the text recognition result is displayed according to the corresponding position relation;

The method for loading the target pictures of the image-text mixed arrangement, the character recognition results of the target pictures and the corresponding position relations between the character recognition results and the target pictures comprises the following steps: when loading a target picture of image-text mixed arrangement through a third party application, invoking an image-text processing plug-in unit to process the target picture so as to generate a text recognition result of the target picture and a corresponding position relation between the text recognition result and the target picture;

the method comprises the steps of calling a picture-text processing plug-in to process the target picture so as to generate a character recognition result of the target picture, wherein the corresponding position relationship between the character recognition result and the target picture comprises the following steps: according to the positions of the sub-pictures in the target picture, adjusting the absolute coordinate positions of texts of the text recognition results in the sub-pictures to be the absolute coordinate positions in the target picture; clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture; and determining the attribution relation between the clustered text recognition result and the sub-picture as the corresponding position relation according to the absolute coordinate position of the text of the clustered text recognition result in the target picture.

12. The method of claim 11, wherein scroll presentation of the target picture in a screen of a terminal in which the client is located comprises:

13. The method of claim 11, wherein in the process of scroll displaying the target picture, displaying the text recognition result according to the corresponding positional relationship comprises:

14. An image-text processing device, comprising:

the position relation establishing module is used for establishing a corresponding position relation between a text recognition result and a text position of the text in the target picture; the character recognition result is used for displaying the character recognition result according to the corresponding position relation in the process of rolling and displaying the target picture in the screen;

The position relation establishing module comprises: the position adjusting unit is used for adjusting the absolute coordinate position of the text recognition result in the sub-picture to be the absolute coordinate position in the target picture according to the position of each sub-picture in the target picture; the clustering unit is used for clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture; the attribution determining unit is used for determining attribution relation between the clustered text recognition result and the sub-picture according to the absolute coordinate position of the text of the clustered text recognition result in the target picture, and the attribution relation is used as the corresponding position relation;

wherein, the character recognition module is specifically further configured to: performing character recognition on the text in the sub-picture by taking the set unit area as an object to obtain recognition characters in the set unit area; wherein the set unit area is a row, a column or an area with a set shape and size;

wherein, the clustering unit includes: the region clustering subunit is used for clustering at least two set unit regions according to the absolute coordinate position of the text recognition result in the target picture; and the character merging subunit is used for merging the identification characters in each set unit area of the clusters and obtaining a character identification result according to the merged identification characters.

15. An image-text display device configured at a client, the device comprising:

the data display module is used for displaying the target picture in a rolling manner in a screen of the terminal where the client is located, and displaying the text recognition result according to the corresponding position relation in the process of displaying the target picture in a rolling manner;

the data loading module is specifically configured to: when loading a target picture of image-text mixed arrangement through a third party application, invoking an image-text processing plug-in unit to process the target picture so as to generate a text recognition result of the target picture and a corresponding position relation between the text recognition result and the target picture;

the data loading module specifically comprises: the position adjusting unit is used for adjusting the absolute coordinate position of the text recognition result in the sub-picture to be the absolute coordinate position in the target picture according to the position of each sub-picture in the target picture; the clustering unit is used for clustering the text recognition result according to the absolute coordinate position of the text recognition result in the target picture; and the attribution determining unit is used for determining attribution relation between the clustered text recognition result and the sub-picture as the corresponding position relation according to the absolute coordinate position of the text of the clustered text recognition result in the target picture.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the teletext processing method of any one of claims 1-10 or to perform the teletext presentation method of any one of claims 11-13.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the teletext processing method according to any one of claims 1-10 or to perform the teletext method according to any one of claims 11-13.