CN115171110B

CN115171110B - Text recognition method and device, equipment, medium and product

Info

Publication number: CN115171110B
Application number: CN202210776958.4A
Authority: CN
Inventors: 章成全; 乔美娜; 吕鹏原; 刘珊珊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-08-22
Anticipated expiration: 2042-06-30
Also published as: CN115171110A

Abstract

The disclosure provides a text recognition method, a text recognition device, text recognition equipment, text recognition media and text recognition products, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR. The specific implementation scheme comprises the following steps: determining a text line to be identified of a first text image in an image sequence; segmenting a text line to be identified to obtain candidate text fragment areas; determining a second effective text patch in the candidate text patches according to a first effective text patch in a second text image adjacent to the first text image in the image sequence; and identifying the second effective text region to obtain a text identification result of the first text image.

Description

Text recognition method and device, equipment, medium and product

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of deep learning, image processing, and computer vision, and is applicable to scenes such as OCR (Optical Character Recognition ).

Background

Text recognition has found widespread use in computer vision, image processing, digital media technology, intelligent translation, autopilot, and other scenarios. However, in some scenes, the text recognition process has a phenomenon that the recognition effect is poor and the recognition timeliness is low.

Disclosure of Invention

The present disclosure provides a text recognition method and apparatus, device, medium, and product.

According to an aspect of the present disclosure, there is provided a text recognition method including: determining a text line to be identified of a first text image in an image sequence; dividing the text line to be identified to obtain candidate text regions; determining a second effective text patch in the candidate text patches according to a first effective text patch in a second text image adjacent to the first text image in the image sequence; and identifying the second effective text segment to obtain a text identification result of the first text image.

According to another aspect of the present disclosure, there is provided a text recognition apparatus including: the text line to be identified determining module is used for determining text lines to be identified of a first text image in the image sequence; the candidate text segment determining module is used for dividing the text line to be identified to obtain candidate text segments; a second effective text segment determining module, configured to determine a second effective text segment in the candidate text segment according to a first effective text segment in a second text image adjacent to the first text image in the image sequence; and the text recognition module is used for recognizing the second effective text segment to obtain a text recognition result of the first text image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text recognition method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described text recognition method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program implementing the above text recognition method when executed by a processor.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture of a text recognition method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a text recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a text recognition method according to yet another embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of a text recognition process according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a text recognition device according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an electronic device for text recognition according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Embodiments of the present disclosure provide a text recognition method. The method comprises the following steps: determining a text line to be identified of a first text image in an image sequence, segmenting the text line to be identified to obtain candidate text regions, determining a second effective text region in the candidate text regions according to a first effective text region in a second text image adjacent to the first text image in the image sequence, and identifying the second effective text region to obtain a text identification result of the first text image.

Fig. 1 schematically illustrates a system architecture of a text recognition method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

The system architecture 100 according to this embodiment may include a requesting terminal 101, a network 102, and a server 103. The network 102 is used as a medium for providing a communication link between the requesting terminal 101 and the server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others. The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, network service, and middleware service.

The requesting terminal 101 interacts with the server 103 through the network 102 to receive or transmit data or the like. The requesting terminal 101 is for example used for initiating a text recognition request to the server 103, and the requesting terminal 101 is for example also used for transmitting an image sequence to be recognized to the server 103, wherein the image sequence comprises a plurality of frames of text images with time sequence relation.

The server 103 may be a server providing various services, and may be, for example, a background processing server (merely an example) that performs text recognition processing in accordance with a text recognition request transmitted by the requesting terminal 101.

For example, in response to a text recognition request obtained from the request terminal 101, the server 103 determines a text line to be recognized of a first text image in the image sequence, segments the text line to be recognized to obtain candidate text patches, determines a second valid text patch in the candidate text patches according to a first valid text patch in a second text image adjacent to the first text image in the image sequence, and recognizes the second valid text patch to obtain a text recognition result of the first text image.

It should be noted that the text recognition method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the text recognition apparatus provided by the embodiments of the present disclosure may be provided in the server 103. The text recognition method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103. Accordingly, the text recognition device provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103.

It should be understood that the number of requesting terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of requesting terminals, networks, and servers, as desired for implementation.

The embodiment of the present disclosure provides a text recognition method, and a text recognition method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4 in conjunction with the system architecture of fig. 1. The text recognition method of the embodiments of the present disclosure may be performed by the server 103 shown in fig. 1, for example.

Fig. 2 schematically illustrates a flow chart of a text recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the text recognition method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S240.

In operation S210, a text line to be identified of a first text image in an image sequence is determined.

In operation S220, the text line to be identified is segmented to obtain candidate text segment areas.

In operation S230, a second valid text patch among the candidate text patches is determined from a first valid text patch in a second text image adjacent to the first text image in the image sequence.

In operation S240, the second effective text segment is identified, and a text identification result of the first text image is obtained.

Determining a text line to be identified of a first text image in an image sequence, and carrying out segmentation processing on the text line to be identified based on a preset pixel scale to obtain a candidate text region, wherein the candidate text region corresponds to a text image region where at least part of characters are located.

And determining a second effective text patch in the candidate text patches according to the first effective text patch in a second text image adjacent to the first text image in the image sequence. The second text image may be, for example, a previous frame text image based on the first text image. For example, a repeated first valid text patch may be determined from among the candidate text patches, and the repeated first valid text patch may be deleted from among the candidate text patches to obtain a second valid text patch. And identifying the second effective text region to obtain a text identification result of the first text image.

And obtaining a text recognition result of the first text image by screening a second effective text region in the first text image and recognizing the second effective text region. The recognition efficiency and recognition precision of the text image recognition can be effectively improved, and the real-time performance of the text image recognition can be improved.

The following illustrates respective operation example flows of the text recognition method of the present embodiment.

Operation S210 shown in fig. 2 further includes: and performing text detection on the first text image to obtain a text detection result, wherein the text detection result comprises boundary box coordinate information for selecting a text image area in the first text image. And determining the text line to be identified according to the coordinate information. The text image area may be, for example, an image area in which the characters are located.

The types of characters may include, for example, standard character types and custom character types, and the standard character types may include, for example, simplified Chinese characters, traditional Chinese characters, english words, numbers, and the like. Accordingly, the text detection method may include, for example, chinese single word detection, chinese phrase detection, english letter detection, english word detection, and the like. The text detection method may be implemented by using, for example, EAST (Efficient and Accurate Scene Text, natural scene text) detection algorithm, and fast-RCNN (Faster Regions Convolutional Neural Networks, fast area convolutional neural network), which is not limited in this embodiment.

In one example, the first text image may be pre-processed for graying, binarizing, noise reduction, tilt correction, text segmentation, etc., prior to text detection of the first text image. The first text image may also be subjected to a display enhancement process, which may include, for example, a contrast enhancement process and a sharpening process, as examples. For example, the display enhancement processing for the first text image may be achieved by adjusting the contrast parameter, the brightness parameter, and the sharpening parameter of the first text image.

The line of text to be identified may be determined from the bounding box coordinate information indicated by the text detection result. The bounding box may be, for example, a rectangular box for enclosing the text image area, and the bounding box coordinate information may include, for example, the abscissa parameters of the vertices of the rectangular box. For example, the text line to be identified may be determined based on a bounding box meeting a preset location condition. The text lines to be identified may be text lines arranged transversely, text lines arranged longitudinally, or text lines arranged in any direction, which is not limited in this embodiment.

Operation S220 shown in fig. 2 further includes: and carrying out segmentation processing on the text line to be identified based on a preset pixel scale to obtain a candidate text segment, wherein the candidate text segment corresponds to a text image area where at least part of characters are located. The candidate text segment may correspond to a text image area where a single character is located, or may correspond to a text image area where a portion of the characters are located.

The candidate text segment is obtained by cutting the text line to be identified, so that the repeated text regions in the text line to be identified can be accurately distinguished, and the accuracy and the calculation efficiency of image sequence identification can be improved.

And determining a second effective text patch in the candidate text patches according to the first effective text patch in the second text image. For example, the repeated first valid text patch in the candidate text patch may be determined according to the first valid text patch in the second text image, and the repeated first valid text patch may be deleted from the candidate text patch to obtain the second valid text patch.

And identifying the second effective text region to obtain a text identification result of the first text image. In an exemplary manner, in the case that the number of the second effective text segments is more than one, text recognition results associated with each second effective text segment may be combined according to a positional relationship between the second effective text segments, to obtain a text recognition result of the first text image.

According to the embodiment of the disclosure, determining a text line to be identified of a first text image in an image sequence; segmenting a text line to be identified to obtain candidate text fragment areas, and determining second effective text fragment areas in the candidate text fragment areas according to first effective text fragment areas in second text images adjacent to the first text images in the image sequence; and identifying the second effective text region to obtain a text identification result of the first text image. The method can effectively improve the text image recognition efficiency and the text image recognition precision, is beneficial to improving the real-time performance of text image recognition and reducing the operation capability requirement on recognition hardware equipment.

Fig. 3 schematically illustrates a flow chart of a text recognition method according to another embodiment of the present disclosure.

As shown in fig. 3, the text recognition method 300 of the embodiment of the present disclosure may include, for example, operations S310 to S340.

In operation S310, for a first text image in an image sequence, a text line to be identified in the first text image is determined.

In operation S320, the text line to be identified is segmented to obtain candidate text segment areas.

In operation S330, a first valid text patch repeatedly appearing in the candidate text patches is determined to obtain a third valid text patch, and the third valid text patch is deleted from the candidate text patches to obtain a second valid text patch.

In operation S340, the second effective text segment is identified, and a text identification result for the first text image is obtained.

It is understood that operation S330 is a further extension to operation S230. Operations S310, S320 and S340 are similar to operations S210, S220 and S240, respectively, and are not repeated for brevity.

Operation S330 shown in fig. 3 further includes: determining a first image feature of the first active text segment and a second image feature of the candidate text segment; obtaining a matching calculation result between the first image feature and the second image feature; and determining a third effective text patch according to the matching calculation result. And the feature matching degree indicated by the matching calculation result corresponding to the third effective text region is higher than a preset threshold value.

The repeated first effective text fragment is determined, and deleted, so that the recognition accuracy and the calculation efficiency of text image recognition are improved, and the real-time performance of text image recognition is improved.

In an exemplary manner, feature extraction can be performed on a text line to be identified in the first text image, so as to obtain initial image features of the text line to be identified. And positioning the attention area of the text line to be recognized according to the text detection result aiming at the first text image, and obtaining the attention image characteristic containing the character occurrence position probability. And fusing the initial image features and the attention image features to obtain fused image features of the text line to be identified.

And under the condition that the number of the candidate text patches is more than one, obtaining the fusion image characteristic associated with each candidate text patch as a first image characteristic based on the fusion image characteristic of the text line to be identified and the position information of each candidate text patch.

In performing feature matching calculation on the first image feature and the second image feature, for example, a full connection layer of the trained text recognition model may be used to output a similarity evaluation value between the first image feature and the second image feature as a matching calculation result. For example, subtraction processing based on the same positional element may be performed on the first image feature and the second image feature, resulting in an intermediate feature vector. The intermediate feature vector can be subjected to absolute value taking processing, and the similarity evaluation value is output by utilizing the full connection layer based on the intermediate feature vector subjected to absolute value taking processing.

Operation S330 shown in fig. 3 further includes: and under the condition that the second text image comprises M first effective text tiles ordered based on the tile coordinates and the first text image comprises N candidate text tiles ordered based on the tile coordinates, obtaining a matching calculation result between the first image feature of the Mth first effective text tile and the second image feature of the first N candidate text tiles.

Under the condition that the matching calculation result indicates that the feature matching degree of the Mth first effective text region and the nth candidate text region is higher than a preset threshold value, determining whether the corresponding feature matching degree of the first n-1 candidate text regions and the Mth first effective text region is higher than the preset threshold value; and under the condition that the matching degree of the corresponding features is higher than a preset threshold value, taking the first n candidate text patches as third effective text patches. M, N is an integer greater than 1, n is an integer and n.epsilon.1, N, M is a positive integer and m= { M- (n-1),. Sub.m-1 }.

The method is beneficial to rapidly screening the repeated first effective text fragment in the candidate text fragment, can effectively improve the recognition accuracy and recognition efficiency of text image recognition, is beneficial to reducing the requirement on the computing capability of recognition hardware equipment, is beneficial to realizing a text recognition scheme with low power consumption and real-time performance, and is beneficial to providing credible data support for application scenes such as computer vision, image processing, digital media technology, intelligent translation, automatic driving and the like.

The patch coordinates may be, for example, bounding box coordinates for surrounding a text image area to which the text patch corresponds, and the bounding box coordinate information may include, for example, abscissa parameters of the bounding box vertices. For example, feature matching calculation may be performed based on the first image feature of the mth first effective text segment and the second image feature of the 1 st candidate text segment, to obtain a matching calculation result. And under the condition that the matching calculation result indicates that the feature matching degree between the Mth first effective text patch and the 1 st candidate text patch is higher than a preset threshold value, determining the 1 st candidate text patch as the repeated first effective text patch, and deleting the 1 st candidate text patch from the text line to be identified.

The image features may include, for example, color features, texture features, gray features, edge features, etc., which are not limited in this embodiment.

And under the condition that the feature matching degree between the Mth first effective text patch and the 1 st candidate text patch is lower than or equal to a preset threshold value, sequentially determining the feature matching degree between the Mth first effective text patch and the following candidate text patches until the feature matching degree between the Mth first effective text patch and the N candidate text patch is higher than the preset threshold value, wherein N is an integer which is more than 1 and less than or equal to N.

And determining whether the corresponding feature matching degree between the first n-1 candidate text patches in the first text image and the n-1 first effective text patches positioned before the Mth first effective text patch is higher than a preset threshold value or not in response to the feature matching degree between the Mth first effective text patch and the n-th candidate text patch being higher than the preset threshold value. And under the condition that the matching degree of the corresponding features between the first n-1 candidate text patches and the first n-1 effective text patches positioned before the Mth first effective text patch is higher than a preset threshold value, taking the first n candidate text patches in the first text image as the repeatedly-appearing first effective text patches, namely taking the first n candidate text patches as the third effective text patches.

And deleting the third effective text patch from the candidate text patches to obtain a second effective text patch. And identifying the second effective text region to obtain a text identification result aiming at the first text image.

Operation S340 shown in fig. 3 further includes: carrying out serialization coding on the third image characteristic of the second effective text region to obtain a basic coding sequence; adding first direction information into the basic coding sequence to obtain a first coding sequence; adding second direction information into the basic coding sequence to obtain a second coding sequence; and performing text recognition based on the first coding sequence and the second coding sequence to obtain a text recognition result. The first direction information indicates the same direction as the distribution direction of the second effective text segment, and the second direction information indicates the opposite direction to the distribution direction.

By adding the first direction information and the second direction information into the basic coding sequence, the recognition accuracy of text image recognition can be effectively improved, and trusted data support can be provided for applications such as automatic driving scene text recognition, shooting translation, intelligent retail commodity inspection, intelligent translation pen, education flat board and the like.

For example, the encoder of the text recognition model may be utilized to sequence encoding the second image feature of the second active text segment to obtain the base encoding sequence. The encoder may be implemented, for example, by a Long Short-Term Memory (LSTM) or a gated recurrent neural network (Gate Recurrent Unit, GRU), which is not limited in this embodiment.

The first direction information may be added to the base code sequence to obtain a first code sequence. And adding second direction information into the basic coding sequence to obtain a second coding sequence. The first direction information indicates the same direction as the distribution direction of the second effective text segment, and the second direction information indicates the opposite direction to the first direction information. For example, in the case where the distribution direction of the second effective text segment is a left-to-right direction, the first direction information indicates the left-to-right direction and the second direction information indicates the right-to-left direction.

The first code sequence may be decoded using a decoder of the text recognition model to obtain a first text recognition result based on the first code sequence. And decoding the second coding sequence to obtain a second text recognition result based on the second coding sequence. The decoder may be implemented, for example, by a matrix decoder (transducer), attention mechanism (Attention), or the like, which is not limited in this embodiment.

The text recognition result for the first text image may be obtained based on the first text recognition result and the second text recognition result. The first text recognition result and the second text recognition result may indicate, for example, a text probability corresponding to the second effective text patch, and a text recognition result corresponding to the maximum text probability may be used as a text recognition result for the first text image.

In the case that the image sequence comprises at least two text images, the text recognition results associated with the at least two text images may be combined according to a temporal relationship between the at least two text images in the image sequence, resulting in a text recognition result for the image sequence.

And deleting the repeated first effective text patch in the candidate text patch to obtain a second effective text patch, and identifying the second effective text patch to obtain a text identification result aiming at the first text image. The recognition accuracy of text image recognition can be effectively improved, the interference of repeated text fragments on text recognition results can be effectively reduced, and the consumption of repeated text fragments on computing resources can be effectively reduced. The method can effectively reduce the imaging coverage requirement of the imaging equipment and the operation capability requirement of the recognition hardware equipment, is favorable for realizing a lightweight, low-power consumption and real-time text recognition scheme, and is favorable for providing diversified product forms for scenes such as mobile terminals/intelligent hardware and the like.

Fig. 4 schematically illustrates a schematic diagram of a text recognition process according to an embodiment of the present disclosure.

As shown in fig. 4, a first text image 401 in the image sequence is identified, so as to obtain a text line to be identified in the first text image 401 (the text line to be identified may be "companion security controllable" in the image 401). The text line to be identified is segmented to obtain candidate text segments 402, which may be represented, for example, by enclosing a dashed rectangular box in the graph 402.

The second text image 403 adjacent to the first text image 401 may be, for example, a text image based on a previous frame of the first text image 401. From the first active text patch 404 in the second text image 403, a repeated first active text patch is determined in the candidate text patches 402, resulting in a third active text patch (the third active text patch may comprise, for example, the first 3 candidate text patches in the candidate text patches 402). And deleting the third effective text patch from the candidate text patch 402 to obtain a second effective text patch 405.

In one example manner, a first image feature of a first active text segment and a second image feature of a candidate text segment may be determined. And performing feature matching calculation based on the first image features and the second image features to obtain a matching calculation result. And taking the corresponding candidate text fragment with the feature matching degree higher than the preset threshold value indicated by the matching calculation result as a third effective text fragment. Each text patch may correspond to, for example, a single character or a text image area in which a portion of the characters are located.

The second active text field 405 is identified resulting in a text identification result 406 for the first text image 401 (the text identification result may be "fully controllable", for example).

Through screening the second effective text region in the first text image and identifying the second effective text region, the identification precision and the identification efficiency of text image identification can be effectively improved, the text image identification performance can be effectively improved, and the method is favorable for meeting diversified product form requirements in scenes such as mobile terminals/intelligent hardware and the like.

Fig. 5 schematically illustrates a block diagram of a text recognition device according to an embodiment of the present disclosure.

As shown in fig. 5, the text recognition apparatus 500 of the embodiment of the present disclosure includes, for example, a text line to be recognized determination module 510, a candidate text patch determination module 520, a second effective text patch determination module 530, and a text recognition module 540.

A text line to be identified determining module 510, configured to determine a text line to be identified of a first text image in the image sequence; the candidate text segment determining module 520 is configured to segment a text line to be identified to obtain a candidate text segment; a second effective text patch determination module 530, configured to determine a second effective text patch in the candidate text patches according to a first effective text patch in a second text image adjacent to the first text image in the image sequence; and a text recognition module 540, configured to recognize the second effective text segment, and obtain a text recognition result of the first text image.

And obtaining a text recognition result aiming at the first text image by screening a second effective text region in the first text image and recognizing the second effective text region. The recognition efficiency and recognition precision of the text image recognition can be effectively improved, the real-time performance of the text image recognition can be improved, and the requirement on the computing capability of recognition hardware equipment can be reduced.

According to an embodiment of the present disclosure, the second active text segment determination module includes: the third effective text patch determining submodule is used for determining the repeated first effective text patches in the candidate text patches to obtain third effective text patches; and a third effective text segment deleting sub-module, configured to delete the third effective text segment from the candidate text segments, to obtain a second effective text segment.

According to an embodiment of the present disclosure, the third active text segment determination submodule includes: an image feature determining unit for determining a first image feature of the first effective text segment and a second image feature of the candidate text segment; the feature matching degree calculation unit is used for obtaining a matching calculation result between the first image feature and the second image feature; the third effective text segment determining unit is used for determining a third effective text segment according to the matching calculation result, and the feature matching degree indicated by the matching calculation result corresponding to the third effective text segment is higher than a preset threshold.

According to an embodiment of the present disclosure, a feature matching degree calculation unit is configured to: under the condition that the second text image comprises M first effective text patches ordered based on the patch coordinates and the first text image comprises N candidate text patches ordered based on the patch coordinates, obtaining a matching calculation result between the first image feature of the Mth first effective text patch and the second image feature of the first N candidate text patches;

the third effective text patch determination unit includes: the feature matching degree calculating subunit is used for determining whether the corresponding feature matching degree of the first n-1 candidate text fragment and the M-th first effective text fragment is higher than a preset threshold value or not under the condition that the matching calculation result indicates that the feature matching degree of the M-th first effective text fragment and the n-th candidate text fragment is higher than the preset threshold value; and a third valid text patch determination subunit configured to take the first n candidate text patches as a third valid text patch if the corresponding feature matching degree is higher than a preset threshold, where M, N is an integer greater than 1, n is an integer and n e [1, n ], M is a positive integer and m= { M- (n-1), M-1}.

According to an embodiment of the present disclosure, a text recognition module includes: the serialization coding submodule is used for carrying out serialization coding on the third image characteristic of the second effective text fragment to obtain a basic coding sequence; the first coding sequence determining submodule is used for adding first direction information into the basic coding sequence to obtain a first coding sequence; the second coding sequence determining submodule is used for adding second direction information into the basic coding sequence to obtain a second coding sequence; and the text recognition sub-module is used for carrying out text recognition based on the first coding sequence and the second coding sequence to obtain a text recognition result, wherein the first direction information indicates the direction same as the distribution direction of the second effective text region, and the second direction information indicates the direction opposite to the distribution direction.

According to an embodiment of the present disclosure, a text line determination module to be recognized includes: the text detection sub-module is used for carrying out text detection on the first text image to obtain a text detection result, and the text detection result comprises boundary box coordinate information for selecting a text image area in the first text image; and the text line to be identified determining submodule is used for determining the text line to be identified according to the coordinate information.

According to an embodiment of the present disclosure, the candidate text tile determination module is to: and carrying out segmentation processing on the text line to be identified based on a preset pixel scale to obtain a candidate text segment, wherein the candidate text segment corresponds to a text image area where at least part of characters are located.

According to an embodiment of the disclosure, the apparatus further includes a text recognition result combining module configured to: and combining text recognition results associated with at least two text images according to the time sequence relation between the at least two text images in the image sequence to obtain the text recognition results aiming at the image sequence.

It should be noted that, in the technical solution of the present disclosure, the related processes of information collection, storage, use, processing, transmission, provision, disclosure and the like all conform to the rules of relevant laws and regulations, and do not violate the public welcome.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. The electronic device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running deep learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a text recognition method. For example, in some embodiments, the text recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the text recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the text recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model training apparatus, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a subject; and a keyboard and pointing device (e.g., a mouse or trackball) by which an object can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the subject may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which an object can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text recognition method, comprising:

determining a text line to be identified of a first text image in an image sequence;

dividing the text line to be identified to obtain candidate text regions;

determining a second effective text patch in the candidate text patches according to a first effective text patch in a second text image adjacent to the first text image in the image sequence; and

Identifying the second effective text segment to obtain a text identification result of the first text image,

wherein the determining, according to the first valid text patch in the second text image adjacent to the first text image in the image sequence, the second valid text patch in the candidate text patch includes:

determining a first effective text patch repeatedly appearing in the candidate text patches to obtain a third effective text patch, wherein the method comprises the following steps: the method comprises the steps that under the condition that M first effective text patches ordered based on patch coordinates are included in the second text image, N candidate text patches ordered based on patch coordinates are included in the first text image, and the feature matching degree of the Mth first effective text patch and the nth candidate text patch is higher than a preset threshold, whether the corresponding feature matching degree of the first N-1 candidate text patches and the mth first effective text patch is higher than the preset threshold is determined; and under the condition that the corresponding feature matching degree is higher than a preset threshold value, taking the first n candidate text patches as the third effective text patches, wherein M, N is an integer greater than 1, n is an integer and n epsilon [1, N ], M is a positive integer and m= { M- (n-1),. The number is M-1}; and

And deleting the third effective text patch from the candidate text patches to obtain the second effective text patch.

2. The method of claim 1, wherein the determining the repeated occurrence of the first valid text patch in the candidate text patch results in a third valid text patch, further comprising:

a first image feature of the first active text segment and a second image feature of the candidate text segment are determined.

3. The method of claim 2, wherein the identifying the second valid text segment to obtain the text identification result of the first text image includes:

carrying out serialization coding on the third image features of the second effective text region to obtain a basic coding sequence;

adding first direction information into the basic coding sequence to obtain a first coding sequence;

adding second direction information into the basic coding sequence to obtain a second coding sequence; and

performing text recognition based on the first code sequence and the second code sequence to obtain the text recognition result,

wherein the first direction information indicates the same direction as the distribution direction of the second effective text segment, and the second direction information indicates the opposite direction to the distribution direction.

4. The method of claim 1, wherein determining the text line to be identified in the first text image in the sequence of images comprises:

performing text detection on the first text image to obtain a text detection result, wherein the text detection result comprises boundary box coordinate information for selecting a text image area in the first text image; and

and determining the text line to be identified according to the coordinate information.

5. The method according to claim 1, wherein the segmenting the text line to be identified to obtain candidate text segments includes:

and carrying out segmentation processing on the text line to be identified based on a preset pixel scale to obtain the candidate text segment, wherein the candidate text segment corresponds to a text image area where at least part of characters are located.

6. The method of any one of claims 1 to 5, further comprising:

and combining text recognition results associated with at least two text images in the image sequence according to the time sequence relation between the at least two text images to obtain the text recognition results aiming at the image sequence.

7. A text recognition device, comprising:

The text line to be identified determining module is used for determining text lines to be identified of a first text image in the image sequence;

the candidate text segment determining module is used for dividing the text line to be identified to obtain candidate text segments;

a second effective text segment determining module, configured to determine a second effective text segment in the candidate text segment according to a first effective text segment in a second text image adjacent to the first text image in the image sequence; and

a text recognition module for recognizing the second effective text segment to obtain a text recognition result of the first text image,

wherein the second active text segment determination module includes:

the third valid text patch determining sub-module is configured to determine a first valid text patch that repeatedly appears in the candidate text patches, to obtain a third valid text patch, and is further configured to: the method comprises the steps that under the condition that M first effective text patches ordered based on patch coordinates are included in the second text image, N candidate text patches ordered based on patch coordinates are included in the first text image, and the feature matching degree of the Mth first effective text patch and the nth candidate text patch is higher than a preset threshold, whether the corresponding feature matching degree of the first N-1 candidate text patches and the mth first effective text patch is higher than the preset threshold is determined; and under the condition that the corresponding feature matching degree is higher than a preset threshold value, taking the first n candidate text patches as the third effective text patches, wherein M, N is an integer greater than 1, n is an integer and n epsilon [1, N ], M is a positive integer and m= { M- (n-1),. The number is M-1}; and

And a third effective text segment deleting sub-module, configured to delete the third effective text segment from the candidate text segment, to obtain the second effective text segment.

8. The apparatus of claim 7, wherein the third active text segment determination submodule comprises:

an image feature determining unit configured to determine a first image feature of the first effective text segment and a second image feature of the candidate text segment;

a feature matching degree calculating unit, configured to obtain a matching calculation result between the first image feature and the second image feature;

a third valid text patch determination unit configured to determine the third valid text patch according to the matching calculation result,

and the feature matching degree indicated by the matching calculation result corresponding to the third effective text segment is higher than a preset threshold.

9. The apparatus of claim 8, wherein the text recognition module comprises:

the serialization coding submodule is used for carrying out serialization coding on the third image characteristic of the second effective text fragment to obtain a basic coding sequence;

the first coding sequence determining submodule is used for adding first direction information into the basic coding sequence to obtain a first coding sequence;

A second coding sequence determining submodule, configured to add second direction information into the base coding sequence to obtain a second coding sequence; and

a text recognition sub-module for performing text recognition based on the first code sequence and the second code sequence to obtain the text recognition result,

10. The apparatus of claim 7, wherein the text line to be identified determination module comprises:

the text detection sub-module is used for carrying out text detection on the first text image to obtain a text detection result, wherein the text detection result comprises boundary box coordinate information for selecting a text image area in the first text image in a box mode; and

and the text line to be identified determining submodule is used for determining the text line to be identified according to the coordinate information.

11. The apparatus of claim 7, wherein the candidate text segment determination module is to:

12. The apparatus according to any one of claims 7 to 11, further comprising a text recognition result combining module configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text recognition method of any one of claims 1-6.

14. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the text recognition method of any one of claims 1-6.