CN115171110A

CN115171110A - Text recognition method, apparatus, device, medium, and product

Info

Publication number: CN115171110A
Application number: CN202210776958.4A
Authority: CN
Inventors: 章成全; 乔美娜; 吕鹏原; 刘珊珊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-11
Anticipated expiration: 2042-06-30
Also published as: CN115171110B

Abstract

The disclosure provides a text recognition method, a text recognition device, text recognition equipment, a text recognition medium and a text recognition product, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The specific implementation scheme comprises the following steps: determining a text line to be identified of a first text image in an image sequence; segmenting a text line to be identified to obtain a candidate text area; determining a second effective text fragment area in the candidate text fragment area according to a first effective text fragment area in a second text image adjacent to the first text image in the image sequence; and identifying the second effective text area to obtain a text identification result of the first text image.

Description

Text recognition method, apparatus, device, medium, and product

Technical Field

The present disclosure relates to the technical field of artificial intelligence, and in particular to the technical field of deep learning, image processing, and computer vision, and can be applied to scenes such as Optical Character Recognition (OCR).

Background

Text recognition has wide application in computer vision, image processing, digital media technology, intelligent translation, automatic driving and other scenes. However, in some scenarios, the text recognition process has the phenomena of poor recognition effect and low recognition timeliness.

Disclosure of Invention

The present disclosure provides a text recognition method and apparatus, device, medium and product.

According to an aspect of the present disclosure, there is provided a text recognition method including: determining a text line to be recognized of a first text image in the image sequence; segmenting the text line to be identified to obtain a candidate text area; determining a second effective text fragment area in the candidate text fragment area according to a first effective text fragment area in a second text image adjacent to the first text image in the image sequence; and identifying the second effective text area to obtain a text identification result of the first text image.

According to another aspect of the present disclosure, there is provided a text recognition apparatus including: the text line to be recognized determining module is used for determining a text line to be recognized of a first text image in the image sequence; the candidate text block determining module is used for segmenting the text lines to be identified to obtain candidate text blocks; a second effective text fragment area determining module, configured to determine a second effective text fragment area in the candidate text fragment areas according to a first effective text fragment area in a second text image adjacent to the first text image in the image sequence; and the text recognition module is used for recognizing the second effective text area to obtain a text recognition result of the first text image.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the text recognition method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text recognition method described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program, when executed by a processor, implementing the text recognition method described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture of a text recognition method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a text recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a text recognition method according to yet another embodiment of the present disclosure;

FIG. 4 schematically shows a schematic diagram of a text recognition process according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a text recognition apparatus according to an embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of an electronic device for text recognition according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B, and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).

The embodiment of the disclosure provides a text recognition method. The method of the embodiment comprises the following steps: determining a text line to be recognized of a first text image in an image sequence, segmenting the text line to be recognized to obtain a candidate text fragment, determining a second effective text fragment in the candidate text fragment according to a first effective text fragment in a second text image adjacent to the first text image in the image sequence, and recognizing the second effective text fragment to obtain a text recognition result of the first text image.

Fig. 1 schematically shows a system architecture of a text recognition method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

The system architecture 100 according to this embodiment may include a requesting terminal 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between requesting terminals 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 103 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as cloud services, cloud computing, network services, and middleware services.

The requesting terminal 101 interacts with the server 103 through the network 102 to receive or transmit data or the like. The request terminal 101 is used for initiating a text recognition request to the server 103, for example, and the request terminal 101 is further used for sending an image sequence to be recognized to the server 103, wherein the image sequence comprises a plurality of frames of text images with a time sequence relation.

The server 103 may be a server that provides various services, and may be, for example, a background processing server (only an example) that performs text recognition processing according to a text recognition request transmitted by the requesting terminal 101.

For example, the server 103 determines a text line to be recognized of a first text image in the image sequence in response to a text recognition request acquired from the request terminal 101, segments the text line to be recognized to obtain a candidate text fragment, determines a second valid text fragment in the candidate text fragment according to a first valid text fragment in a second text image adjacent to the first text image in the image sequence, and recognizes the second valid text fragment to obtain a text recognition result of the first text image.

It should be noted that the text recognition method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the text recognition apparatus provided by the embodiment of the present disclosure may be disposed in the server 103. The text recognition method provided by the embodiments of the present disclosure may also be performed by a server or a cluster of servers that is different from the server 103 and is capable of communicating with the requesting terminal 101 and/or the server 103. Accordingly, the text recognition apparatus provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 103 and capable of communicating with the requesting terminal 101 and/or the server 103.

It should be understood that the number of requesting terminals, networks, and servers in fig. 1 is merely illustrative. There may be any number of requesting terminals, networks, and servers, as desired for an implementation.

The embodiment of the present disclosure provides a text recognition method, and the text recognition method according to the exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4 in conjunction with the system architecture of fig. 1. The text recognition method of the embodiment of the present disclosure may be performed by the server 103 shown in fig. 1, for example.

Fig. 2 schematically shows a flow chart of a text recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the text recognition method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S240.

In operation S210, a text line to be recognized of a first text image in an image sequence is determined.

In operation S220, the text line to be recognized is segmented to obtain a candidate text segment.

In operation S230, a second valid text fragment of the candidate text fragments is determined according to a first valid text fragment in a second text image adjacent to the first text image in the image sequence.

In operation S240, the second valid text segment is recognized to obtain a text recognition result of the first text image.

Determining a text line to be recognized of a first text image in the image sequence, and performing segmentation processing on the text line to be recognized based on a preset pixel scale to obtain a candidate text area, wherein the candidate text area corresponds to a text image area where at least part of characters are located.

And determining a second effective text fragment area in the candidate text fragment area according to the first effective text fragment area in a second text image adjacent to the first text image in the image sequence. The second text image may be, for example, a previous frame text image based on the first text image. For example, a first valid text fragment that repeatedly appears in the candidate text fragments may be determined, and the first valid text fragment that repeatedly appears may be deleted from the candidate text fragments to obtain a second valid text fragment. And identifying the second effective text area to obtain a text identification result of the first text image.

And obtaining a text recognition result of the first text image by screening the second effective text area in the first text image and recognizing the second effective text area. The recognition efficiency and recognition precision of the text image recognition can be effectively improved, and the real-time performance of the text image recognition can be improved.

The following illustrates exemplary flows of operations of the text recognition method of the present embodiment.

Operation S210 shown in fig. 2 further includes: and performing text detection on the first text image to obtain a text detection result, wherein the text detection result comprises the coordinate information of the boundary box for framing the text image area in the first text image. And determining the text line to be identified according to the coordinate information. The text image area may be, for example, the image area where the character is located.

The types of characters may include standard character types and custom character types, for example, the standard character types may include simplified chinese characters, traditional chinese characters, english words, numbers, and the like. Accordingly, the text detection method may include, for example, a method of detecting a chinese single word, a method of detecting a chinese phrase, a method of detecting an english letter, a method of detecting an english word, and the like. The Text detection method can be implemented by, for example, EAST (Efficient and Accurate Scene Text) detection algorithm, fast-RCNN (fast Regions Convolutional Neural Networks), which is not limited in this embodiment.

By way of example, the first text image may be pre-processed for graying, binarization, noise reduction, tilt correction, word segmentation, etc., prior to text detection. The first text image may also be subjected to display enhancement processing, which may include, for example, contrast enhancement processing and sharpening processing. For example, the display enhancement processing for the first text image may be implemented by adjusting a contrast parameter, a brightness parameter, and a sharpening parameter of the first text image.

The text line to be recognized may be determined based on the bounding box coordinate information indicated by the text detection result. The bounding box may be, for example, a rectangular box for enclosing the text image region, and the bounding box coordinate information may include, for example, the abscissa and ordinate parameters of the vertices of the rectangular box. For example, the text line to be recognized may be determined based on a bounding box meeting a preset position condition. The text lines to be recognized may be text lines arranged horizontally, text lines arranged longitudinally, or text lines arranged in any direction, which is not limited in this embodiment.

Operation S220 shown in fig. 2 further includes: and performing segmentation processing on the text line to be recognized based on a preset pixel scale to obtain a candidate text area, wherein the candidate text area corresponds to a text image area where at least part of characters are located. The candidate text fragment area may correspond to a text image area where a single character is located, or may correspond to a text image area where a part of characters are located.

The candidate text block is obtained by segmenting the text line to be recognized, so that the text region which repeatedly appears in the text line to be recognized can be accurately distinguished, and the accuracy and the calculation efficiency of image sequence recognition can be improved.

And determining a second effective text area in the candidate text areas according to the first effective text area in the second text image. For example, the second valid text fragment may be obtained by determining a first valid text fragment that repeatedly appears in the candidate text fragment according to the first valid text fragment in the second text image, and deleting the first valid text fragment that repeatedly appears from the candidate text fragment.

And identifying the second effective text area to obtain a text identification result of the first text image. In an example manner, when the number of the second effective text areas is more than one, the text recognition results associated with the second effective text areas may be combined according to the position relationship between the second effective text areas to obtain the text recognition result of the first text image.

By the embodiment of the disclosure, a text line to be recognized of a first text image in an image sequence is determined; segmenting a text line to be identified to obtain a candidate text fragment, and determining a second effective text fragment in the candidate text fragment according to a first effective text fragment in a second text image adjacent to the first text image in the image sequence; and identifying the second effective text area to obtain a text identification result of the first text image. The method can effectively improve the efficiency of text image recognition, effectively improve the accuracy of text image recognition, is favorable for improving the real-time performance of text image recognition, and is favorable for reducing the requirement on the operational capability of recognition hardware equipment.

Fig. 3 schematically shows a flow chart of a text recognition method according to another embodiment of the present disclosure.

As shown in fig. 3, the text recognition method 300 of the embodiment of the present disclosure may include, for example, operations S310 to S340.

In operation S310, for a first text image in an image sequence, a text line to be recognized in the first text image is determined.

In operation S320, the text line to be recognized is segmented to obtain a candidate text segment.

In operation S330, a first valid text segment repeatedly appearing in the candidate text segments is determined to obtain a third valid text segment, and the third valid text segment is deleted from the candidate text segments to obtain a second valid text segment.

In operation S340, the second valid text segment is recognized, and a text recognition result for the first text image is obtained.

It is to be appreciated that operation S330 is a further extension to operation S230. Operations S310, S320, and S340 are similar to operations S210, S220, and S240, respectively, and are not described again for brevity.

The following exemplifies each operation example flow of the text recognition method of the present embodiment.

Operation S330 shown in fig. 3 further includes: determining a first image feature of the first effective text region and a second image feature of the candidate text region; obtaining a matching calculation result between the first image characteristic and the second image characteristic; and determining a third effective text area according to the matching calculation result. And the feature matching degree indicated by the matching calculation result corresponding to the third effective text area is higher than a preset threshold value.

By determining the first effective text fragment area which repeatedly appears in the candidate text fragment area and deleting the first effective text fragment area which repeatedly appears, the identification precision and the calculation efficiency of text image identification are favorably improved, and the real-time performance of the text image identification is favorably improved.

By way of example, feature extraction may be performed on a text line to be recognized in a first text image, resulting in initial image features of the text line to be recognized. And according to the text detection result aiming at the first text image, carrying out attention area positioning on the text line to be recognized to obtain the attention image characteristic containing the character occurrence position probability. And fusing the initial image features and the attention image features to obtain fused image features of the text line to be identified.

And under the condition that the number of the candidate text sections is more than one, obtaining a fusion image characteristic associated with each candidate text section as a first image characteristic based on the fusion image characteristic of the text line to be identified and the position information of each candidate text section.

In performing the feature matching calculation on the first image feature and the second image feature, for example, a similarity evaluation value between the first image feature and the second image feature may be output as a matching calculation result using a fully connected layer of the trained text recognition model. For example, the first image feature and the second image feature may be subjected to subtraction processing based on the same position element, resulting in an intermediate feature vector. The absolute value processing may be performed on the intermediate feature vector, and the similarity evaluation value may be output using the full link layer based on the absolute value-processed intermediate feature vector.

Operation S330 shown in fig. 3 further includes: and under the condition that the second text image comprises M first effective text regions based on region coordinate sorting and the first text image comprises N candidate text regions based on region coordinate sorting, obtaining a matching calculation result between the first image feature of the Mth first effective text region and the second image feature of the first N candidate text regions.

Under the condition that the matching calculation result indicates that the feature matching degree of the Mth first effective text fragment and the nth candidate text fragment is higher than a preset threshold value, determining whether the corresponding feature matching degree of the first n-1 candidate text fragments and the mth first effective text fragment is higher than the preset threshold value; and taking the first n candidate text sections as a third effective text section under the condition that the matching degree of the corresponding characteristics is higher than a preset threshold value. M, N are integers greater than 1, N is an integer and N ∈ [1, N ], M is a positive integer and M = { M- (N-1),.. Multidot., M-1}.

The method is beneficial to quickly screening the first effective text fragment which repeatedly appears in the candidate text fragment, can effectively improve the identification accuracy and identification efficiency of text image identification, is beneficial to reducing the requirement on the computing capability of identification hardware equipment, is beneficial to realizing a text identification scheme with low power consumption and real-time performance, and is beneficial to providing credible data support for application scenes such as computer vision, image processing, digital media technology, intelligent translation, automatic driving and the like.

The tile coordinates may be, for example, bounding box coordinates for enclosing a text image region corresponding to the text tile, and the bounding box coordinate information may include, for example, a horizontal and vertical coordinate parameter of a vertex of the bounding box. For example, feature matching calculation may be performed based on the first image feature of the mth first valid text fragment and the second image feature of the 1 st candidate text fragment, so as to obtain a matching calculation result. And under the condition that the matching calculation result indicates that the feature matching degree between the Mth first effective text fragment and the 1 st candidate text fragment is higher than a preset threshold value, determining the 1 st candidate text fragment as a first effective text fragment which appears repeatedly, and deleting the 1 st candidate text fragment from the text line to be identified.

The image features may include, for example, color features, texture features, grayscale features, edge features, and the like, which is not limited in this embodiment.

And under the condition that the feature matching degree between the Mth first effective text area and the 1 st candidate text area is lower than or equal to a preset threshold, sequentially determining the feature matching degree between the Mth first effective text area and the subsequent candidate text area until the feature matching degree between the Mth first effective text area and the nth candidate text area is higher than the preset threshold, wherein N is an integer which is greater than 1 and less than or equal to N.

And in response to the characteristic matching degree between the Mth first effective text fragment and the nth candidate text fragment being higher than a preset threshold, determining whether the corresponding characteristic matching degrees between the first n-1 candidate text fragments in the first text image and the n-1 first effective text fragments before the Mth first effective text fragment are higher than the preset threshold. And under the condition that the corresponding feature matching degrees between the first n-1 candidate text areas and the n-1 first effective text areas positioned before the Mth first effective text area are all higher than a preset threshold value, taking the first n candidate text areas in the first text image as the first repeated effective text areas, namely taking the first n candidate text areas as the third effective text areas.

And deleting the third effective text area from the candidate text area to obtain a second effective text area. And identifying the second effective text area to obtain a text identification result aiming at the first text image.

Operation S340 shown in fig. 3 further includes: carrying out serialization coding on the third image characteristics of the second effective text area to obtain a basic coding sequence; adding first direction information into the basic coding sequence to obtain a first coding sequence; adding second direction information into the basic coding sequence to obtain a second coding sequence; and performing text recognition based on the first coding sequence and the second coding sequence to obtain a text recognition result. The first direction information indicates the same direction as the distribution direction of the second valid text area, and the second direction information indicates the opposite direction to the distribution direction.

By adding the first direction information and the second direction information into the basic coding sequence, the recognition precision of text image recognition can be effectively improved, and credible data support is provided for the applications of automatic driving scene text recognition, photographing translation, intelligent retail commodity inspection, intelligent translation pens, education tablets and the like.

For example, the encoder of the text recognition model may be utilized to perform sequential encoding on the second image feature of the second valid text region to obtain the base encoding sequence. The encoder may be implemented by, for example, a Long Short-Term Memory network (LSTM) or a gated Recurrent neural network (GRU), which is not limited in this embodiment.

First directional information may be added to the base code sequence to obtain a first code sequence. And adding second direction information into the basic coding sequence to obtain a second coding sequence. The first direction information indicates the same direction as the direction of distribution of the second valid text area, and the second direction information indicates the opposite direction to the first direction information. Illustratively, in a case where the distribution direction of the second valid text area is in a left-to-right direction, the first direction information indicates the left-to-right direction, and the second direction information indicates the right-to-left direction.

The first encoded sequence may be decoded using a decoder of the text recognition model to obtain a first text recognition result based on the first encoded sequence. And decoding the second coding sequence to obtain a second text recognition result based on the second coding sequence. The decoder may be implemented by, for example, a matrix decoder (transform), attention mechanism (Attention), etc., and this embodiment is not limited thereto.

A text recognition result for the first text image may be derived based on the first text recognition result and the second text recognition result. The first text recognition result and the second text recognition result may indicate, for example, a probability of a word corresponding to the second valid text fragment, and the text recognition result corresponding to the maximum probability of a word may be used as the text recognition result for the first text image.

When the image sequence includes at least two text images, the text recognition results associated with the at least two text images may be combined according to a time sequence relationship between the at least two text images in the image sequence to obtain a text recognition result for the image sequence.

And deleting the repeated first effective text fragment in the candidate text fragment to obtain a second effective text fragment, and identifying the second effective text fragment to obtain a text identification result aiming at the first text image. The recognition accuracy of the text image recognition can be effectively improved, the interference of the repeated text areas on the text recognition result can be effectively reduced, and the consumption of the repeated text areas on the computing resources can be effectively reduced. The method can effectively reduce the requirements on the imaging coverage of the imaging equipment, can effectively reduce the requirements on the operational capability of identifying hardware equipment, is favorable for realizing a text identification scheme with light weight, low power consumption and real-time property, and is favorable for providing diversified product form possibilities for scenes such as mobile terminals/intelligent hardware and the like.

FIG. 4 schematically shows a schematic diagram of a text recognition process according to an embodiment of the present disclosure.

As shown in fig. 4, a first text image 401 in the image sequence is recognized, and a text line to be recognized in the first text image 401 is obtained (the text line to be recognized may be "accompanied by security control" in the image 401). The text line to be recognized is segmented to obtain a candidate text fragment 402, and the candidate text fragment can be represented by enclosing a dashed rectangle in the graph 402.

The second text image 403 adjacent to the first text image 401 may be, for example, a text image of a previous frame based on the first text image 401. From the first valid text run 404 in the second text image 403, a first valid text run that repeatedly appears is determined in the candidate text run 402, resulting in a third valid text run (which may for example comprise the first 3 candidate text runs in the candidate text run 402). The third valid text run is deleted from the candidate text run 402 resulting in a second valid text run 405.

By way of example, a first image feature of a first valid text region and a second image feature of a candidate text region may be determined. And performing feature matching calculation based on the first image features and the second image features to obtain a matching calculation result. And taking the corresponding candidate text area with the characteristic matching degree indicated by the matching calculation result higher than a preset threshold value as a third effective text area. Each text fragment may for example correspond to a text image area in which a single character or a part of a character is located.

The second valid text region 405 is recognized resulting in a text recognition result 406 for the first text image 401 (the text recognition result may for example be "fully controllable").

Through screening the second effective text areas in the first text image and identifying the second effective text areas, the identification precision and the identification efficiency of text image identification can be effectively improved, the text image identification performance can be effectively improved, and the diversified product form requirements in scenes such as mobile terminals/intelligent hardware can be favorably met.

Fig. 5 schematically shows a block diagram of a text recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the text recognition apparatus 500 of the embodiment of the present disclosure includes, for example, a to-be-recognized text line determination module 510, a candidate text section determination module 520, a second valid text section determination module 530, and a text recognition module 540.

A text line to be recognized determining module 510, configured to determine a text line to be recognized of a first text image in the image sequence; a candidate text segment determining module 520, configured to segment the text line to be identified to obtain a candidate text segment; a second valid text segment determining module 530, configured to determine a second valid text segment in the candidate text segments according to the first valid text segment in a second text image adjacent to the first text image in the image sequence; and a text recognition module 540, configured to recognize the second valid text region, so as to obtain a text recognition result of the first text image.

And screening a second effective text area in the first text image, and identifying the second effective text area to obtain a text identification result aiming at the first text image. The recognition efficiency and recognition precision of text image recognition can be effectively improved, the real-time performance of text image recognition can be improved, and the requirement on the computing capacity of recognition hardware equipment can be reduced.

According to an embodiment of the present disclosure, the second valid text area determination module includes: a third valid text segment determining sub-module, configured to determine a first valid text segment that repeatedly appears in the candidate text segments, to obtain a third valid text segment; and a third valid text block deletion sub-module for deleting the third valid text block from the candidate text block to obtain a second valid text block.

According to an embodiment of the present disclosure, the third valid text fragment area determination sub-module includes: an image feature determination unit configured to determine a first image feature of the first valid text segment and a second image feature of the candidate text segment; the characteristic matching degree calculating unit is used for acquiring a matching calculation result between the first image characteristic and the second image characteristic; and the third effective text area determining unit is used for determining a third effective text area according to the matching calculation result, and the feature matching degree indicated by the matching calculation result corresponding to the third effective text area is higher than a preset threshold value.

According to an embodiment of the present disclosure, the feature matching degree calculation unit is configured to: under the condition that the second text image comprises M first effective text regions based on region coordinate sorting and the first text image comprises N candidate text regions based on region coordinate sorting, obtaining a matching calculation result between a first image feature of the Mth first effective text region and a second image feature of the first N candidate text regions;

the third valid text segment determination unit includes: the characteristic matching degree calculation operator unit is used for determining whether the corresponding characteristic matching degree of the first n-1 candidate text sections and the mth first effective text section is higher than a preset threshold value or not under the condition that the matching calculation result indicates that the characteristic matching degree of the mth first effective text section and the nth candidate text section is higher than the preset threshold value; and a third valid text fragment determining subunit, configured to, when the corresponding feature matching degree is higher than the preset threshold, take the first N candidate text fragments as a third valid text fragment, where M and N are integers greater than 1, N is an integer and N ∈ [1, N ], M is a positive integer and M = { M- (N-1),.. And.m-1 }.

According to an embodiment of the present disclosure, a text recognition module includes: the serialization coding submodule is used for carrying out serialization coding on the third image characteristics of the second effective text area to obtain a basic coding sequence; the first coding sequence determining submodule is used for adding first direction information into the basic coding sequence to obtain a first coding sequence; the second coding sequence determining submodule is used for adding second direction information into the basic coding sequence to obtain a second coding sequence; and the text recognition sub-module is used for performing text recognition on the basis of the first coding sequence and the second coding sequence to obtain a text recognition result, the first direction information indicates the direction same as the distribution direction of the second effective text area, and the second direction information indicates the direction opposite to the distribution direction.

According to the embodiment of the disclosure, the text line to be recognized determining module comprises: the text detection submodule is used for carrying out text detection on the first text image to obtain a text detection result, and the text detection result comprises boundary box coordinate information used for selecting a text image area in the first text image; and the text line to be recognized determining submodule is used for determining the text line to be recognized according to the coordinate information.

According to an embodiment of the present disclosure, the candidate text region determination module is configured to: and performing segmentation processing on the text line to be recognized based on a preset pixel scale to obtain a candidate text area, wherein the candidate text area corresponds to a text image area where at least part of characters are located.

According to the embodiment of the disclosure, the device further comprises a text recognition result combination module, which is used for: and combining the text recognition results associated with the at least two text images according to the time sequence relation between the at least two text images in the image sequence to obtain the text recognition result aiming at the image sequence.

It should be noted that the technical solutions of the present disclosure, including the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like, all comply with the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. The electronic device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running deep learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a text recognition method. For example, in some embodiments, the text recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 600 via ROM 602 and/or communications unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the text recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the text recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model training apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to an object; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which objects can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the object may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which objects can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A text recognition method, comprising:

determining a text line to be recognized of a first text image in the image sequence;

segmenting the text lines to be identified to obtain candidate text areas;

determining a second effective text fragment area in the candidate text fragment area according to a first effective text fragment area in a second text image adjacent to the first text image in the image sequence; and

and identifying the second effective text area to obtain a text identification result of the first text image.

2. The method of claim 1, wherein said determining a second active text run in the candidate text run from a first active text run in a second text image in the sequence of images adjacent to the first text image comprises:

determining a first effective text area which repeatedly appears in the candidate text areas to obtain a third effective text area; and

and deleting the third effective text area from the candidate text area to obtain the second effective text area.

3. The method of claim 2, wherein the determining a first valid text run that repeatedly appears in the candidate text runs resulting in a third valid text run comprises:

determining a first image feature of the first valid text run and a second image feature of the candidate text run;

obtaining a matching calculation result between the first image feature and the second image feature;

determining the third valid text segment according to the matching calculation result,

and the feature matching degree indicated by the matching calculation result corresponding to the third effective text area is higher than a preset threshold value.

4. The method of claim 3, wherein said obtaining a matching computation between the first image feature and the second image feature comprises:

under the condition that the second text image comprises M first effective text regions based on region coordinate sorting and the first text image comprises N candidate text regions based on region coordinate sorting, obtaining a matching calculation result between first image features of the Mth first effective text region and second image features of the first N candidate text regions;

the determining the third valid text segment according to the matching calculation result includes:

under the condition that the matching calculation result indicates that the feature matching degree of the Mth first effective text fragment and the nth candidate text fragment is higher than a preset threshold value, determining whether the corresponding feature matching degree of the first n-1 candidate text fragments and the mth first effective text fragment is higher than the preset threshold value; and

taking the first n candidate text sections as the third valid text section when the corresponding feature matching degree is higher than a preset threshold,

wherein M and N are integers which are more than 1, N is an integer and N belongs to [1, N ], M is a positive integer and M = { M- (N-1) }.

5. The method of claim 3, wherein the recognizing the second valid text region to obtain the text recognition result of the first text image comprises:

carrying out serialization coding on the third image characteristics of the second effective text area to obtain a basic coding sequence;

adding first direction information into the basic coding sequence to obtain a first coding sequence;

adding second direction information into the basic coding sequence to obtain a second coding sequence; and

performing text recognition based on the first coding sequence and the second coding sequence to obtain the text recognition result,

wherein the first direction information indicates a direction same as a distribution direction of the second valid text area, and the second direction information indicates a direction opposite to the distribution direction.

6. The method of claim 1, wherein the determining a line of text to be identified in a first text image in a sequence of images comprises:

performing text detection on the first text image to obtain a text detection result, wherein the text detection result comprises boundary box coordinate information used for selecting a text image area in the first text image; and

and determining the text line to be identified according to the coordinate information.

7. The method of claim 1, wherein the segmenting the text lines to be identified to obtain candidate text regions comprises:

and performing segmentation processing on the text line to be recognized based on a preset pixel scale to obtain the candidate text area, wherein the candidate text area corresponds to a text image area where at least part of characters are located.

8. The method of any of claims 1 to 7, further comprising:

and combining the text recognition results associated with the at least two text images according to the time sequence relation between the at least two text images in the image sequence to obtain the text recognition result aiming at the image sequence.

9. A text recognition apparatus comprising:

the text line to be recognized determining module is used for determining a text line to be recognized of a first text image in the image sequence;

the candidate text block determining module is used for segmenting the text lines to be identified to obtain candidate text blocks;

a second effective text fragment area determining module, configured to determine a second effective text fragment area in the candidate text fragment areas according to a first effective text fragment area in a second text image adjacent to the first text image in the image sequence; and

and the text recognition module is used for recognizing the second effective text area to obtain a text recognition result of the first text image.

10. The apparatus of claim 9, wherein the second valid text region determining module comprises:

a third valid text area determining sub-module, configured to determine a first valid text area that repeatedly appears in the candidate text areas, to obtain a third valid text area; and

and the third effective text area deleting submodule is used for deleting the third effective text area from the candidate text area to obtain the second effective text area.

11. The apparatus of claim 10, wherein the third valid text pad determination sub-module comprises:

an image feature determination unit configured to determine a first image feature of the first valid text segment and a second image feature of the candidate text segment;

a feature matching degree calculation unit configured to acquire a matching calculation result between the first image feature and the second image feature;

a third valid text section determining unit configured to determine the third valid text section based on the matching calculation result,

12. The apparatus of claim 11, wherein the feature matching degree calculation unit is configured to:

the third valid text area determination unit includes:

the characteristic matching degree operator unit is used for determining whether the corresponding characteristic matching degree of the first n-1 candidate text sections and the mth first effective text section is higher than a preset threshold value or not under the condition that the matching calculation result indicates that the characteristic matching degree of the mth first effective text section and the nth candidate text section is higher than the preset threshold value; and

a third valid text section determining subunit, configured to, in a case that the corresponding feature matching degree is higher than a preset threshold, take the first n candidate text sections as the third valid text section,

wherein M and N are integers greater than 1, N is an integer and N is epsilon [1, N ], M is a positive integer and M = { M- (N-1),.. Multidot.M-1 }.

13. The apparatus of claim 11, wherein the text recognition module comprises:

the serialization coding submodule is used for carrying out serialization coding on the third image characteristics of the second effective text area to obtain a basic coding sequence;

the first coding sequence determining submodule is used for adding first direction information into the basic coding sequence to obtain a first coding sequence;

the second coding sequence determination submodule is used for adding second direction information into the basic coding sequence to obtain a second coding sequence; and

a text recognition sub-module for performing text recognition based on the first coding sequence and the second coding sequence to obtain the text recognition result,

14. The apparatus of claim 9, wherein the to-be-recognized text line determination module comprises:

the text detection submodule is used for performing text detection on the first text image to obtain a text detection result, wherein the text detection result comprises boundary box coordinate information used for framing a text image area in the first text image; and

and the text line to be recognized determining submodule is used for determining the text line to be recognized according to the coordinate information.

15. The apparatus of claim 9, wherein the candidate text region determination module is to:

and carrying out segmentation processing on the text line to be recognized based on a preset pixel scale to obtain the candidate text area, wherein the candidate text area corresponds to a text image area where at least part of characters are located.

16. The apparatus according to any one of claims 9 to 15, further comprising a text recognition result combination module for:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text recognition method of any one of claims 1-8.

18. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text recognition method according to any one of claims 1 to 8.

19. A computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program when executed by a processor implementing a text recognition method according to any one of claims 1 to 8.