WO2021189889A1 - 场景图像中的文本检测方法、装置、计算机设备及存储介质 - Google Patents

场景图像中的文本检测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021189889A1
WO2021189889A1 PCT/CN2020/131604 CN2020131604W WO2021189889A1 WO 2021189889 A1 WO2021189889 A1 WO 2021189889A1 CN 2020131604 W CN2020131604 W CN 2020131604W WO 2021189889 A1 WO2021189889 A1 WO 2021189889A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
confidence
text prediction
pixels
prediction box
Prior art date
Application number
PCT/CN2020/131604
Other languages
English (en)
French (fr)
Inventor
高远
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189889A1 publication Critical patent/WO2021189889A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Definitions

  • This application relates to the field of image processing technology, in particular to text detection methods, devices and computer equipment in scene images.
  • Character recognition based on computer vision is of great significance in the current era of big data. It is the basis for the realization of many intelligent functions (such as recommendation systems, machine translation, etc.).
  • the text detection is a prerequisite for the text recognition process, and its detection accuracy has a significant impact on the effect of text recognition.
  • the inventor realizes that in a complex natural scene, the text has the characteristics of a variety of different position distributions, various layout forms, inconsistent distribution directions, and multi-language mixing. Therefore, the task of text detection is extremely challenging.
  • CTPN text detection algorithm
  • This application aims to solve the technical problem that the recognition accuracy of the existing EAST algorithm cannot meet actual use requirements.
  • an embodiment of the present application provides a text detection method in a scene image, including: training and optimizing a full convolutional network model;
  • the trained full convolutional network model detect and determine several text prediction boxes in the scene image; filter in the text prediction box, the pixels with a confidence greater than a preset confidence threshold are regarded as high confidence
  • the confidence is the probability that the pixel belongs to the text prediction frame outputted by the full convolutional network model; according to the high confidence pixel, the smallest bounding rectangle corresponding to the text prediction frame is calculated, so The minimum enclosing rectangle is the rectangle with the smallest area including all high-confidence pixels in the text prediction box; calculating the degree of overlap between the text prediction box and the corresponding minimum enclosing rectangle; in the degree of overlap
  • adjust the width of the text prediction box through the minimum circumscribed rectangle cut the adjusted text prediction box in the scene image to obtain the text image to be recognized; recognize the text image to be recognized; Recognize the text in the text image.
  • an embodiment of the present application provides a text detection device for scene images, including:
  • the training unit is used to train and optimize the full convolutional network model;
  • the text prediction frame detection unit is used to detect and determine several text prediction frames in the scene image through the trained full convolutional network model;
  • the screening unit is used to screen pixels with a confidence greater than a preset confidence threshold in the text prediction box as high-confidence pixels, and the confidence is output from the full convolutional network model.
  • the probability of belonging to the text prediction box; the minimum bounding rectangle determining unit is used to calculate the minimum bounding rectangle corresponding to the text prediction box according to the high-confidence pixel points, and the minimum bounding rectangle is the The rectangle with the smallest area including high-confidence pixels; an overlap calculation unit for calculating the overlap between the text prediction box and the corresponding minimum bounding rectangle; an adjustment unit for when the overlap is greater than When the overlap degree threshold is preset, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; a cutting unit is configured to cut the adjusted text prediction box in the scene image to obtain a text image to be recognized; The text recognition unit is used to recognize the text information in the text image to be recognized.
  • an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer program When implementing the following steps:
  • pixels with a confidence greater than a preset confidence threshold are used as high-confidence pixels, the confidence is output by the full convolutional network model, and the pixels belong to the text prediction box Probability
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to perform the following operations:
  • pixels with a confidence greater than a preset confidence threshold are used as high-confidence pixels, the confidence is output by the full convolutional network model, and the pixels belong to the text prediction box Probability
  • the text detection method provided by the embodiments of the present application can correct and adjust the width of the text prediction box through a high-confidence area based on the EAST method for text detection, so that the width of the text prediction box can be reliably reduced. Achieve more accurate text recognition.
  • FIG. 1 is a schematic structural diagram of a computer device provided by an embodiment of this application.
  • FIG. 2 is a schematic flowchart of a method for text detection of scene images according to an embodiment of the application
  • FIG. 3 is a schematic flowchart of step 20 in FIG. 1;
  • FIG. 4 is a schematic diagram of the process of screening the smallest bounding rectangle provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a text detection device for scene images provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of a text detection device for scene images provided by another embodiment of the application.
  • the embodiments of this application first provide a text detection method for scene images.
  • the text detection method for scene images provided by this application can be used to implement text detection by using the EAST method to adjust the text detection frame through high-confidence regions. Width to achieve more accurate text recognition.
  • FIG. 1 is a schematic structural diagram of a computer device 100 according to an embodiment of the present application.
  • the computer equipment 100 may be a computer, a computer cluster, a mainstream computer, a computing device dedicated to providing online content, or a computer network including a group of computers operating in a centralized or distributed manner.
  • the computer device 100 includes: a processor 102, a memory, and a network interface 105 connected through a system bus 101; wherein, the memory may include a non-volatile storage medium 103 and an internal memory 104.
  • the processor 102 may be a central processing unit (Central Processing Unit, CPU), and the processor 102 may also be other general-purpose processors or digital signal processors (Digital Signal Processors). Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the number of processors 102 may be one or more, and one or more processors 102 can execute a sequence of computer program instructions to execute text detection methods of various scene images which will be described in more detail below.
  • the computer program instructions are stored in, accessed, and read from the non-volatile storage medium 103 so as to be executed by the processor 10 to implement the adjustment method disclosed in the following embodiments of the present application.
  • the nonvolatile storage medium 103 stores a software application that executes the adjustment method described below.
  • the nonvolatile storage medium 103 may store the entire software application or only a part of the software application executable by the processor 102. It should be noted that although only one block is shown in FIG. 1, the non-volatile storage medium 103 may include multiple physical devices installed on a central processing device or different computing devices.
  • the network interface 105 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 1 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 100 to which the solution of the present application is applied.
  • the specific computer device 100 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the text detection method of the scene image disclosed in the embodiments of the present application.
  • the computer program product is embodied on one or more computer-readable storage media (including but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer program code.
  • FIG. 2 shows a schematic diagram of a method for adjusting scene text in an embodiment, and the method in FIG. 2 is described in detail below. Please refer to Figure 2.
  • the method includes the following steps:
  • Step 20 Train and optimize the full convolutional network model.
  • the fully convolutional network model is a kind of neural network model. Before use, it is necessary to use training data for offline training, and determine the transfer weight parameters between neurons.
  • the step 20 specifically includes the following steps:
  • Step 200 Construct a fully convolutional network model.
  • the network structure of the full convolutional network model can be decomposed into three parts: feature extraction layer, feature merging and output layer.
  • the feature extraction layer uses a general convolutional network as the basic network.
  • feature extraction is performed after initializing the parameters of the convolutional network.
  • the optimized convolutional network parameters are obtained.
  • basic networks such as accelerated model performance (Pvanet, Performance Vs Accuracy), VGG16 model (Visual Geometry Group16), etc. according to actual needs.
  • four levels of feature maps can be obtained through the extraction of the convolutional network, the sizes of which are 1/32, 1/16, 1/8, and 1/4 of the input image data in order.
  • a large receptive field is required to locate a large text, and a small receptive field is required to locate a small text area. Therefore, by using the above-mentioned feature maps of different levels, it is possible to meet the use requirements of a large difference in the size of the text area in a natural scene.
  • the U-shaped idea is used to merge the feature maps of the above four levels layer by layer to achieve the effect of reducing the computational overhead in the later stage.
  • the layer-by-layer merging method can be expressed by the following formula:
  • the text score feature map and the geometric feature map with a size of 1/4 of the original image are output.
  • the number of channels of the text score feature map is 1, and the number of channels of the geometric feature map is 5.
  • the text score feature map represents the confidence that each pixel belongs to the text prediction box.
  • Step 202 Annotate training tags, and construct a training data set.
  • any existing suitable method can be used to complete the labeling of the training labels, which is used as a training data set to train the full convolutional network model.
  • Step 204 Train and optimize the full convolutional network model through the training data set and the preset loss function.
  • Training optimization is the process of learning and optimizing the parameters of the full convolutional network model. When the parameter optimization is completed, the fully trained fully convolutional network model can be applied to text detection in actual scenes.
  • the optimization process In addition to the marked training data, the optimization process also needs to provide a suitable loss function to evaluate the effect of the full convolutional network model, and to achieve parameter optimization by minimizing the loss.
  • the loss function can be expressed by the following formula:
  • L is the loss function
  • Ls is the loss of the text feature score map
  • Lg is the loss of the geometric figure feature map
  • ⁇ g represents the importance between the two losses, which can be set to 1.
  • the loss of the text feature score map can be calculated using class-balanced cross-entropy.
  • the overlap degree (IOU, interaction over union) loss function can be used for calculation.
  • Step 22 Detect and determine several text prediction boxes in the scene image through the trained full convolutional network model.
  • the text prediction box in the scene image to be detected can be determined. That is, the area in the scene image that contains text.
  • the output layer of the full convolutional network model may include a text score feature map and a geometric figure feature map.
  • the text score feature map records the probability that when each pixel is mapped to the image to be detected, the pixel belongs to the text prediction frame.
  • the geometric feature map records the distance between each pixel and the text prediction frame when each pixel is mapped to the image to be detected.
  • the full convolutional network model usually outputs a large number of candidate text prediction boxes. Therefore, in a preferred embodiment, a non-maximum value suppression algorithm can also be used to eliminate redundant text prediction boxes to determine the position of the best text prediction box, and the best text prediction box is the embodiment of this application.
  • the text prediction box in.
  • the scene picture is a picture that can be interpreted as a picture taken in a real scene in this embodiment, for example, a picture obtained through viewfinder of any suitable terminal with a camera.
  • Step 24 Filter pixels with a confidence greater than a preset confidence threshold in the text prediction box as high-confidence pixels.
  • the confidence is the probability that the pixel belongs to the text prediction frame output by the full convolutional network model. That is, the text feature score graph shows the confidence of each pixel, which reflects the possibility that text prediction boxes may exist in different positions. In this step, some high-confidence pixels can be screened out through a suitable screening method, which can be used for further adjustment and optimization of the text prediction box.
  • the confidence threshold may be set to 0.7, and then it is sequentially determined whether the pixels in the text feature score map are greater than the confidence threshold. If it is, the pixel is determined as a high-confidence pixel. If not, discard the pixel.
  • Step 26 Calculate the minimum bounding rectangle corresponding to the text prediction box according to the high-confidence pixel points.
  • the minimum bounding rectangle (MBR, minimum bounding rectangle) is expressed in two-dimensional coordinates, and is the maximum range of high-confidence pixels in the same text prediction box. It represents the rectangular area given by the high-confidence pixels of the same text prediction box, and is the rectangle with the smallest area that includes all the high-confidence pixels in the text prediction box.
  • any suitable algorithm can be used to calculate and determine the minimum bounding rectangle of each text prediction box.
  • it may specifically include the following steps:
  • the two most distant high-confidence pixels are the length calibration pixels.
  • the width-calibrated pixels are passed through and are aligned with the width-calibrated pixels.
  • the second line segment between which the line is perpendicular is used as the width to enclose the minimum circumscribed rectangle.
  • Step 28 Calculate the degree of overlap between the text prediction box and the corresponding minimum enclosing rectangle.
  • the degree of overlap can also be called “intersection ratio”, which is used to characterize the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle. It is calculated by the area ratio between the intersection between the two boxes and the union. The higher the degree of overlap, the higher the degree of matching between the two boxes.
  • the degree of overlap between the text prediction box and the corresponding minimum enclosing rectangle can be calculated through the following steps:
  • the pixels within the text prediction box and the minimum circumscribed rectangle at the same time are the first pixels, and the pixels that only belong to the text prediction box or the minimum circumscribed rectangle are the second pixels. point;
  • the ratio between the number of the first pixel and the sum of the number of the first pixel and the second pixel is calculated as the degree of overlap.
  • Step 30 When the overlap degree is greater than a preset overlap degree threshold, adjust the width of the text prediction frame through the minimum enclosing rectangle.
  • the overlap threshold is an empirical value, which can be set by a technician according to the needs of the actual situation.
  • the width of the smallest circumscribed rectangle is smaller than the width of the text prediction box, which means that the area within the smallest circumscribed rectangle is more likely to belong to the text area. Therefore, the text prediction box can be appropriately adjusted through the minimum circumscribed rectangle, and its width can be reduced accordingly.
  • the text prediction box is adjusted by the following formula:
  • P1 is the width of the adjusted text prediction box
  • w is the weight coefficient
  • p is the width of the text prediction box
  • d is the width of the corresponding minimum bounding rectangle.
  • the width of the text prediction box can be corrected and adjusted according to the smaller effective minimum circumscribed rectangle, so that the width of the text prediction box can be reliably reduced, and more accurate text recognition can be achieved.
  • Step 32 Cut the adjusted text prediction box in the scene image to obtain a text image to be recognized.
  • the adjusted text prediction box indicates the location of the text in the scene image. Therefore, these text prediction boxes can be cut out from the scene image and used as the text image to be recognized.
  • Step 34 Recognize the text information in the text image to be recognized.
  • Applying the text detection method provided by the embodiments of the present application can reliably reduce the width of the text prediction box, realize more accurate text recognition, reduce the difficulty of subsequent processing, and improve the accuracy of text detection.
  • the minimum enclosing rectangle is used as the standard for final adjustment of the width of the text detection box. Therefore, it is necessary to ensure that the minimum bounding rectangle has good reliability, otherwise the subsequent adjustment process may cause undesirable consequences.
  • the method may further include the step of screening the smallest bounding rectangle as shown in FIG. 4:
  • Step 401 Calculate the confidence average value of the high-confidence pixel points in the minimum bounding rectangle.
  • the confidence average refers to the confidence average of these high-confidence pixels, and represents the probability that the smallest bounding rectangle belongs to the text area as a whole.
  • Step 402 Determine whether the average confidence level is less than a preset screening threshold. If yes, go to step 403. If not, go to step 404.
  • Step 403 Eliminate the minimum bounding rectangle.
  • Step 404 Reserve the minimum circumscribed rectangle as the effective minimum circumscribed rectangle. These effective minimum bounding rectangles can be used for the next step of processing, as a reference for adjusting the text detection frame.
  • the embodiment of the present application also provides a text detection device corresponding to the text detection method of the scene image in the above embodiment.
  • the text detection device 500 includes: a training unit 50, a text prediction frame detection unit 52, a screening unit 54, a minimum circumscribed rectangle determination unit 56, an overlap degree calculation unit 58, an adjustment unit 60, and a cutting unit.
  • Unit 62 and text recognition unit 64 are examples of text detection unit 64.
  • the training unit 50 is used to train and optimize the full convolutional network model.
  • the text prediction box detection unit 52 is configured to filter the pixels with a confidence greater than a preset confidence threshold in the text prediction box as high-confidence pixels, and the confidence is the output of the full convolutional network model. , The probability that a pixel belongs to the text prediction box; the minimum bounding rectangle determining unit 54 is configured to calculate the minimum bounding rectangle corresponding to the text prediction box according to the high-confidence pixel, and the minimum bounding rectangle is the prediction of the text The rectangle with the smallest area including all high-confidence pixels in the box; the overlap degree calculation unit 58 is used to calculate the overlap degree between the text prediction box and the corresponding minimum bounding rectangle.
  • the adjusting unit 60 is configured to adjust the width of the text prediction frame through the minimum circumscribed rectangle when the overlap degree is greater than a preset overlap degree threshold.
  • the cutting unit 62 is configured to cut the adjusted text prediction box in the scene image to obtain a text image to be recognized.
  • the text recognition unit 64 is used to recognize the text information in the text image to be recognized.
  • the text detection device for scene images provided by the embodiments of the present application can correct and adjust the width of the text prediction box through a high-confidence area based on the implementation of text detection using the EAST method, so that the width of the text prediction box is reliable Is reduced to achieve more accurate text recognition.
  • the text detection apparatus 500 may further include a confidence calculation unit 66 and a minimum bounding rectangle screening unit 68.
  • the confidence calculation unit 66 is used to calculate the confidence average value of the high-confidence pixel points within the minimum bounding rectangle.
  • the minimum circumscribed rectangle screening unit 68 is configured to eliminate the minimum circumscribed rectangle when the average confidence value is less than a preset screening threshold.
  • the minimum bounding rectangle (MBR, minimum bounding rectangle) is expressed in two-dimensional coordinates and is the maximum range of high-confidence pixels in the same text prediction box. It represents a rectangular area given by high-confidence pixels of the same text prediction box.
  • the minimum circumscribed rectangle can be determined or calculated in any suitable manner. Calculating and determining the corresponding minimum circumscribed rectangle when multiple pixels are known is well known to those skilled in the art, and will not be summarized here.
  • Applying the text detection device for scene images provided by the embodiments of the present application can reliably reduce the width of the text prediction box, and achieve more accurate text recognition.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

一种场景图像的文本检测方法、装置及计算机设备,该方法包括:通过训练后的全卷积网络模型,检测确定所述场景图像中的若干个文本预测框(22);筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点(24);根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形(26);在重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度(30);在所述场景图像中切割所述调整后的文本预测狂,获得待识别文本图像(32);识别所述待识别文本图像中的文本信息(34)。所述方法可以在使用EAST方法实现文本检测的基础上,通过高置信度的区域对文本预测框的宽度进行校正和调整,使其宽度可靠的被缩小,实现更加精确的文本识别。

Description

场景图像中的文本检测方法、装置、计算机设备及存储介质
本申请要求于2020年3月26日提交中国专利局、申请号为202010223195.1,发明名称为“场景图像中的文本检测方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,具体涉及场景图像中的文本检测方法、装置及计算机设备。
背景技术
基于计算机视觉的文字识别在现今的大数据时代具有非常重大的使用意义。其是许多智能化功能(例如推荐系统、机器翻译等)的实现基础。而文本检测作为文字识别过程的前提条件,其检测精准度对于文字识别的效果具有显著的影响。
发明人意识到,在复杂的自然场景下,文本存在多种不同位置的分布、排布形式多样、分布方向不一致以及多语言混合等的特点,因此文本检测的任务极具挑战性。
传统技术中存在一种被称为CTPN的文本检测算法,其基于将完整文本先分割检测再合并的思路来实现自然场景下的文本检测。传统技术通过分割再合并的方式检测文本一方面检测精度不准确,另一方面会过度消耗检测时间,用户体验差,基于此,还有人提出了一种被称为EAST(an efficient and accurate scene text detector)的文本检测方法。其借助FCN的架构来进行特征提取和学习,直接进行端到端的训练和优化,消除不必要的中间步骤。
但是,在EAST的实际应用过程中,仍然存在着许多的局限性,无法很好的满足实际使用的需求。例如,最终获得的文本预测框的宽度与场景中实际的文本不相符,因此传统技术需要在EAST的实际应用基础上,进一步改进。
发明内容
本申请旨在解决现有的EAST算法识别精度无法满足实际使用需求的技术问题。
为解决上述技术问题,第一方面,本申请实施例提供了一种场景图像中的文本检测方法,包括:对全卷积网络模型进行训练优化;
通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;计算所述文本预测框与对应的最小外接矩形之间的重叠度;在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;识别所述待识别文本图像中的文字。
第二方面,本申请实施例提供了一种场景图像的文本检测装置,包括:
训练单元,用于对全卷积网络模型进行训练优化;文本预测框检测单元单元,用于通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;筛选单元,用于筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;最小外接矩形确定单元,用于根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;重叠度计算单元,用于计算所述文本预测框与对应的最小外接矩形之间的重叠度;调整单元,用于在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度; 切割单元,用于在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;文本识别单元,用于识别所述待识别文本图像中的文本信息。
第三方面,本申请实施例又提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
对全卷积网络模型进行训练优化;
通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
计算所述文本预测框与对应的最小外接矩形之间的重叠度;
在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
识别所述待识别文本图像中的文本信息。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
对全卷积网络模型进行训练优化;
通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
计算所述文本预测框与对应的最小外接矩形之间的重叠度;
在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
识别所述待识别文本图像中的文本信息。
本申请实施例提供的文本检测方法,可以在使用EAST方法实现文本检测的基础上,通过高置信度的区域对文本预测框的宽度进行校正和调整,使得文本预测框的宽度可靠的被缩小,实现更加精确的文本识别。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种计算机设备的结构示意图;
图2为本申请实施例提供的一种场景图像的文本检测方法的流程示意图;
图3为图1中步骤20的流程示意图;
图4为本申请实施例提供的筛选最小外接矩形的流程示意图;
图5为本申请实施例提供的一种场景图像的文本检测装置的示意图;
图6为本申请另一实施例提供的一种场景图像的文本检测装置的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的 实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
本申请实施例首先提供一种场景图像的文本检测方法,应用本申请提供的场景图像的文本检测方法可以在使用EAST方法实现文本检测的基础上,通过高置信度的区域来调整文本检测框的宽度,实现更精确的文本识别。
以下首先对该调整方法的硬件环境进行介绍,请参阅图1,图1是本申请实施例提供的一种计算机设备100的结构示意图。该计算机设备100可以是计算机、计算机集群、主流计算机、专用于提供在线内容的计算装置,或者计算机网络,所述计算机网络包括一组以集中或分布方式操作的计算机。
如图1所示,所述计算机设备100包括:通过系统总线101连接的处理器102、存储器和网络接口105;其中,存储器可以包括非易失性存储介质103和内存储器104。
在本申请实施例中,根据所使用的硬件的类型,处理器102可以是中央处理单元(Central Processing Unit,CPU),该处理器102还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。处理器102的数量可以是一个或者多个,一个或者多个处理器102可执行计算机程序指令的序列,以执行将在下文更详细地说明的各种场景图像的文本检测方法。
计算机程序指令由非易失性存储介质103存储、访问和从该非易失性存储介质103中读取,以便由处理器10执行,从而实现本申请下述实施例公开的调整方法。例如,非易失性存储介质103存储执行下述调整方法的软件应用。此外,非易失性存储介质103可存储整个软件应用或者只存储可由处理器102执行的软件应用的一部分。应注意,尽管图1中只示出一个框,但非易失性存储介质103可包括安装在中央处理装置或不同计算装置上的多个物理装置。
该网络接口105用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备100的限定,具体的计算机设备100可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
本申请实施例还提供一种计算机可读存储介质。该计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的场景图像的文本检测方法。该计算机程序产品体现在含有计算机程序代码的一个或多个计算机可读存储介质上(包括但不限于,磁盘存储器、CD-ROM、光学存储器等)。
在用软件实施所述计算机设备100的情况下,图2示出了一实施例的场景文本的调整方法的示意图,以下对图2中的方法详细描述。请参阅图2,该方法包括如下步骤:
步骤20、对全卷积网络模型进行训练优化。
该全卷积网络模型是神经网络模型的一种。在使用前,需要利用训练数据进行离线训练, 确定其中神经元之间的传递权重参数。
在一些实施例中,如图3所示,所述步骤20具体包括如下步骤:
步骤200、构建全卷积网络模型。
此步骤通过全卷积网络模型对输入的场景图片相关的图像数据进行特征提取,最终生成一个单通道像素级的文本分数特征图以及一个多通道的几何图形特征图。具体而言,该全卷积网络模型的网络结构可以分解为三个部分:特征提取层、特征合并和输出层。
首先,特征提取层采用通用的卷积网络作为基础网络。在训练时,对卷积网络的参数初始化后进行特征提取。训练完成后,获得优化的卷积网络参数。在实际应用中,可以根据实际情况的需要,选择使用加速模型性能(Pvanet,Performance Vs Accuracy),VGG16模型(Visual Geometry Group16)等基础网络。本申请实施例经由该卷积网络提取可以获得四个级别的特征图,其大小依次为输入的图像数据的1/32,1/16,1/8以及1/4。由于定位大文本需要使用大的感受野,而定位小文本区域时则相应的需要使用小的感受野。因此,通过使用上述不同级别的特征图可以满足自然场景中,文本区域大小差别很大的使用要求。
其次,使用U型的思想逐层对上述四个级别的特征图进行合并,实现降低后期计算开销的效果。其中,该逐层合并的方法可以通过如下算式表示:
Figure PCTCN2020131604-appb-000001
Figure PCTCN2020131604-appb-000002
上述算式的具体过程如下:在每个合并阶段中,首先将来自上一个阶段特征图输入到上池化层(unpool层),扩大其大小。然后,将其与当前层特征图进行合并。最后,通过卷积层(conv层),具体为conv1ⅹ1层减少通道数量和计算量,并通过conv3ⅹ3层将局部信息融合以最终产生合并阶段的输出。在最后一个合并阶段之后(即i=4),conv3×3层会生成合并分支的最终特征图并将其送到输出层。
最后,在输出层输出尺寸为原图1/4的文本分数特征图与几何图形特征图,文本分数特征图通道数为1,几何图形特征图通道数为5。其中,文本分数特征图表示每个像素点属于文本预测框的置信度。
步骤202、标注训练标签,构建训练数据集。
此步骤具体可以采用现有任何合适的方式完成训练标签的标注,作为训练数据集对全卷积网络模型进行训练。在一些情况下,也可以直接使用已有的训练数据集进行训练或者测试。
步骤204、通过所述训练数据集和预设的损失函数,对所述全卷积网络模型进行训练优化。
训练优化是对全卷积网络模型的参数的学习优化过程。当参数优化完成后,完成训练的全卷积网络模型便可以被应用于实际场景的文本检测。
优化的过程除了需要标注好的训练数据以外,还需要提供合适的损失函数,用以评价全卷积网络模型的效果,通过最小化损失损失的方式来实现参数优化。
在本申请中,损失函数可以通过如下算式表示:
L=Ls+λgLg
其中,L为损失函数,Ls为文本特征分数图的损失,Lg为几何图形特征图的损失,λg表示两个损失之间的重要性,可以设置为1。
具体而言,对于文本特征分数图的损失,可以使用类平衡交叉熵来计算。而对于几何图形特征图的损失则可以使用重叠度(IOU,interaction over union)损失函数来进行计算。
步骤22、通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框。
通过训练后的所述全卷积网络模型,可以确定待检测的场景图像中的文本预测框。亦即,场景图像中包含文字的区域。
如上所述,全卷积网络模型的输出层可以包括文本分数特征图与几何图形特征图。其中,文本分数特征图记录了每个像素点映射到待检测图像时,该像素点属于文本预测框的概率。几何图形特征图记录了每个像素点映射到待检测图像时,该像素点与文本预测框之间的距离。
该全卷积网络模型通常会输出数量较多的候选的文本预测框。由此,在较佳的实施例中,还可以应用非极大值抑制算法消除多余的文本预测框以确定最佳的文本预测框的位置,该最佳的文本预测框即为本申请实施例中的文本预测框。
该场景图片为在本实施例中可以解释为在真实场景下拍摄的图片,例如,通过任何合适的带摄像头的终端取景所获得图片。
步骤24、筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点。
其中,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率。亦即,文本特征分数图中表示了各个像素点的置信度,从而体现了不同位置可能存在文本预测框的情况。此步骤通过合适的筛选方式,筛选出一些较高置信度的像素点可以用于进行文本预测框进一步的调整和优化。
具体而言,可以通过设置合适的置信度阈值的方式,在文本特征分数图中筛选高置信度像素点。例如,可以将置信度阈值设置为0.7,然后,依次判断所述文本特征分数图中的像素点是否大于该置信度阈值。若是,则将该像素点确定为高置信度像素点。若否,则放弃该像素点。
在一个待检测图像中,可能存在着多个不同的文本预测框。因此,这些高置信度像素点可能是属于场景中不同的文本框。相应地,为避免出现调整或者校正的错误,需要对高置信度像素点进行标记和区分。具体而言,可以根据像素点所在的位置可以确定其具体属于哪一个文本预测框,从而将所述高置信度像素点分别归类到对应的文本预测框。
步骤26、根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形。
其中,最小外接矩形(MBR,minimum bounding rectangle)是以二维坐标表示的,同一个文本预测框中的高置信度像素点的最大范围。其表示由同一个文本预测框的高置信度像素点给定的矩形区域,是将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形。
具体可以使用任何合适的算法来计算确定每个文本预测框的最小外接矩形。
在一些实施例中,具体可以包括如下步骤:
首先,确定所述高置信度像素点中,距离最远的两个高置信度像素点为长度标定像素点。
然后,以所述长度标定像素点之间的连线作为第一方向,确定在与所述第一方向垂直的第二方向上,距离最远的两个高置信度像素点作为宽度标定像素点。
最后,以经过所述长度标定像素点并且与所述长度标定像素点之间的连线垂直的第一线段作为长的同时,以经过所述宽度标定像素点并且与所述宽度标定像素点之间的连线垂直的第二线段作为宽,即可围成所述最小外接矩形。
步骤28、计算所述文本预测框与对应的最小外接矩形之间的重叠度。
重叠度(IOU)又可以被称为“交并比”,用于表征文本预测框与对应的最小外接矩形之间的重合程度。其由两个框之间的交集和并集之间的面积比来计算获得。重叠度越高表明两个框之间的匹配程度越高。
在一些实施例中,具体可以通过如下步骤计算文本预测框与对应的最小外接矩形之间的重叠度:
首先,分别确定同时在所述文本预测框和所述最小外接矩形之内的像素点为第一像素点以及只属于所述文本预测框或所述最小外接矩形之内的像素点为第二像素点;
然后,计算所述第一像素点和所述第二像素点的数量之和。
最后,计算所述第一像素点的数量与所述第一像素点和所述第二像素点的数量之和之间的比值,作为所述重叠度。
步骤30、在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度。
重叠度阈值是一个经验性数值,可以根据实际情况的需要由技术人员设定。通常的,最小外接矩形的宽度是小于文本预测框的宽度的,其表示了在该最小外接矩形内的区域具有更大的可能是属于文本区域。由此,可以通过最小外接矩形来适当的调整文本预测框,使其宽度相应的缩小。
具体而言,在所述重叠度大于预设的重叠度阈值时,所述文本预测框通过如下公式调整:
P1=w*p+(1-w)*d,
其中,P1为调整后的文本预测框宽度,w为权重系数,p为所述文本预测框的宽度,d为所述对应的最小外接矩形的宽度。
通过上述算式,赋予合适的w值以后,便可以根据较小的有效最小外接矩形对文本预测框的宽度进行校正和调整,使得文本预测框的宽度可以可靠的被缩小,实现更加精确的文本识别。
步骤32、在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像。
调整后的文本预测框提示了场景图像中包含了文本的位置。由此,可以将这些文本预测框从场景图像中切割出来,作为待识别文本图像。
步骤34、识别所述待识别文本图像中的文本信息。
具体可以选择使用任何类型的算法或者方式识别获取文本图像中的文本信息,得到最终的场景图像的文本检测结果。其为本领域技术人员所熟知,在此不作赘述。
应用本申请实施例提供的文本检测方法,可使得文本预测框的宽度可靠的被缩小,实现更加精确的文本识别,降低后续处理的难度和提升文本检测准确度。
由于最小外接矩形是作为最终调整文本检测框的宽度的标准的。因此,需要保证最小外接矩形具有良好的可靠性,否则后续的调整过程可能反而造成不良后果。
在一些实施例中,在执行步骤28之前,所述方法还可以包括如图4所示的筛选最小外接矩形的步骤:
步骤401:计算所述最小外接矩形内的高置信度像素点的置信度平均值。
该置信度平均值是指这些高置信度像素点的置信度均值,表示了该最小外接矩形总体上属于文本区域的概率。
步骤402:判断置信度平均值是否小于预设的筛选阈值。若是,执行步骤403。若否,执行步骤404。
步骤403:剔除所述最小外接矩形。
可以理解的是,那些置信度平均值较低的最小外接矩形实际上并没有很高的可靠性或者概率属于文本,不足以作为校正的标准。因此,可以将这些最小外接矩形剔除,不使用这些最小外接矩形进行文本预测框的宽度校正。
步骤404:保留所述最小外接矩形作为有效的最小外接矩形。这些有效的最小外接矩形可以用于进行下一步的处理,作为调整文本检测框的参考。
本申请实施例还提供一种与上述实施例中的场景图像的文本检测方法对应的文本检测装置,请参阅图5,图5提供了本申请实施例提供的一种场景图像的文本检测装置的结构框图,如图5所示,所述文本检测装置500包括:训练单元50、文本预测框检测单元52、筛选单元54、最小外接矩形确定单元56、重叠度计算单元58、调整单元60、切割单元62和文本识别单元64。
训练单元50用于对全卷积网络模型进行训练优化。
文本预测框检测单元52用于筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;最小外接矩形确定单元54用于根据所述高置信度像素点,计算所述文本预测 框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;重叠度计算单元58用于计算所述文本预测框与对应的最小外接矩形之间的重叠度。调整单元60用于在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度。切割单元62用于在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像。文本识别单元64用于识别所述待识别文本图像中的文本信息。
本申请实施例提供的场景图像的文本检测装置,可以在使用EAST方法实现文本检测的基础上,通过高置信度的区域对文本预测框的宽度进行校正和调整,使得文本预测框的宽度可靠的被缩小,实现更加精确的文本识别。
在一些实施例中,如图6所示,除了图5所示的功能模块外,所述文本检测装置500还可以包括:置信度计算单元66以及最小外接矩形筛选单元68。
其中,置信度计算单元66用于计算所述最小外接矩形内的高置信度像素点的置信度平均值。最小外接矩形筛选单元68用于在所述置信度平均值小于预设的筛选阈值时,剔除所述最小外接矩形。
最小外接矩形(MBR,minimum bounding rectangle)是以二维坐标表示的,同一个文本预测框中的高置信度像素点的最大范围。其表示由同一个文本预测框的高置信度像素点给定的矩形区域。该最小外接矩形可以使用任何合适的方式来确定或者计算,在已知多个像素点的情况下计算确定其对应的最小外接矩形为本技术领域人员所熟知,此处不再概述。
应用本申请实施例提供的场景图像的文本检测装置,可使得文本预测框的宽度可靠的被缩小,实现更加精确的文本识别。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种场景图像的文本检测方法,其中,包括:
    对全卷积网络模型进行训练优化;
    通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
    筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
    根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
    计算所述文本预测框与对应的最小外接矩形之间的重叠度;
    在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
    在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
    识别所述待识别文本图像中的文本信息。
  2. 根据权利要求1所述的方法,其中,在计算所述文本预测框与对应的最小外接矩形之间的重叠度之前,所述方法还包括:
    计算所述最小外接矩形内的高置信度像素点的置信度平均值;
    在所述置信度平均值小于预设的筛选阈值时,剔除所述最小外接矩形。
  3. 根据权利要求2所述的方法,其中,所述对全卷积网络模型进行训练优化,包括:
    构建全卷积网络模型;
    标注训练标签,构建训练数据集;
    通过所述训练数据集和预设的损失函数,对所述全卷积网络模型进行训练优化。
  4. 根据权利要求1所述的方法,其中,所述计算所述文本预测框与对应的最小外接矩形之间的重叠度,包括:
    确定同时在所述文本预测框和所述最小外接矩形之内的像素点为第一像素点;
    确定只属于所述文本预测框或所述最小外接矩形之内的像素点为第二像素点;
    计算所述第一像素点和所述第二像素点的数量之和;
    计算所述第一像素点的数量与所述第一像素点和所述第二像素点的数量之和之间的比值,作为所述重叠度。
  5. 根据权利要求1所述的方法,其中,在所述重叠度大于预设的重叠度阈值时,所述文本预测框通过如下公式调整:
    P1=w*p+(1-w)*d,
    其中,P1为调整后的文本预测框宽度,w为权重系数,p为所述文本预测框的宽度,d为所述对应的最小外接矩形的宽度。
  6. 根据权利要求1所述的方法,其中,所述根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,包括:
    确定所述高置信度像素点中,距离最远的两个高置信度像素点为长度标定像素点;
    以所述长度标定像素点之间的连线作为第一方向,确定在与所述第一方向垂直的第二方向上,距离最远的两个高置信度像素点作为宽度标定像素点;
    以经过所述长度标定像素点并且与所述长度标定像素点之间的连线垂直的第一线段作为长的同时,以经过所述宽度标定像素点并且与所述宽度标定像素点之间的连线垂直的第二线段作为宽,围成所述最小外接矩形。
  7. 根据权利要求1所述的方法,其中,所述全卷积网络模型的网络结构包括特征提取层、特征合并和输出层;其中,所述特征提取层用于获取输入的图像数据的1/32,1/16,1/8以及1/4分别对应的特征图。
  8. 根据权利要求7所述的方法,其中,所述述全卷积网络模型的的输出层包括文本分数 特征图与几何图形特征图;其中,文本分数特征图记录了每个像素点映射到待检测图像时,该像素点属于文本预测框的概率;几何图形特征图记录了每个像素点映射到待检测图像时,该像素点与文本预测框之间的距离。
  9. 一种场景图像的文本检测装置,其中,包括:
    训练单元,用于对全卷积网络模型进行训练优化;
    文本预测框检测单元,用于通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
    筛选单元,用于筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
    最小外接矩形确定单元,用于根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
    重叠度计算单元,用于计算所述文本预测框与对应的最小外接矩形之间的重叠度;
    调整单元,用于在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
    切割单元,用于在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
    文本识别单元,用于识别所述待识别文本图像中的文本信息。
  10. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    对全卷积网络模型进行训练优化;
    通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
    筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
    根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
    计算所述文本预测框与对应的最小外接矩形之间的重叠度;
    在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
    在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
    识别所述待识别文本图像中的文本信息。
  11. 根据权利要求10所述的计算机设备,其中,在计算所述文本预测框与对应的最小外接矩形之间的重叠度之前,所述方法还包括:
    计算所述最小外接矩形内的高置信度像素点的置信度平均值;
    在所述置信度平均值小于预设的筛选阈值时,剔除所述最小外接矩形。
  12. 根据权利要求11所述的计算机设备,其中,所述对全卷积网络模型进行训练优化,包括:
    构建全卷积网络模型;
    标注训练标签,构建训练数据集;
    通过所述训练数据集和预设的损失函数,对所述全卷积网络模型进行训练优化。
  13. 根据权利要求10所述的计算机设备,其中,所述计算所述文本预测框与对应的最小外接矩形之间的重叠度,包括:
    确定同时在所述文本预测框和所述最小外接矩形之内的像素点为第一像素点;
    确定只属于所述文本预测框或所述最小外接矩形之内的像素点为第二像素点;
    计算所述第一像素点和所述第二像素点的数量之和;
    计算所述第一像素点的数量与所述第一像素点和所述第二像素点的数量之和之间的比值,作为所述重叠度。
  14. 根据权利要求10所述的计算机设备,其中,在所述重叠度大于预设的重叠度阈值时,所述文本预测框通过如下公式调整:
    P1=w*p+(1-w)*d,
    其中,P1为调整后的文本预测框宽度,w为权重系数,p为所述文本预测框的宽度,d为所述对应的最小外接矩形的宽度。
  15. 根据权利要求10所述的计算机设备,其中,所述根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,包括:
    确定所述高置信度像素点中,距离最远的两个高置信度像素点为长度标定像素点;
    以所述长度标定像素点之间的连线作为第一方向,确定在与所述第一方向垂直的第二方向上,距离最远的两个高置信度像素点作为宽度标定像素点;
    以经过所述长度标定像素点并且与所述长度标定像素点之间的连线垂直的第一线段作为长的同时,以经过所述宽度标定像素点并且与所述宽度标定像素点之间的连线垂直的第二线段作为宽,围成所述最小外接矩形。
  16. 根据权利要求10所述的计算机设备,其中,所述全卷积网络模型的网络结构包括特征提取层、特征合并和输出层;其中,所述特征提取层用于获取输入的图像数据的1/32,1/16,1/8以及1/4分别对应的特征图。
  17. 根据权利要求16所述的计算机设备,其中,所述述全卷积网络模型的的输出层包括文本分数特征图与几何图形特征图;其中,文本分数特征图记录了每个像素点映射到待检测图像时,该像素点属于文本预测框的概率;几何图形特征图记录了每个像素点映射到待检测图像时,该像素点与文本预测框之间的距离。
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
    对全卷积网络模型进行训练优化;
    通过训练后的所述全卷积网络模型,检测确定所述场景图像中的若干个文本预测框;
    筛选在所述文本预测框内,置信度大于预设的置信度阈值的像素点作为高置信度像素点,所述置信度为所述全卷积网络模型输出的,像素点属于文本预测框的概率;
    根据所述高置信度像素点,计算所述文本预测框对应的最小外接矩形,所述最小外接矩形为将所述文本预测框中所有高置信度像素点包含在内,面积最小的矩形;
    计算所述文本预测框与对应的最小外接矩形之间的重叠度;
    在所述重叠度大于预设的重叠度阈值时,通过所述最小外接矩形调整所述文本预测框的宽度;
    在所述场景图像中切割所述调整后的文本预测框,获得待识别文本图像;
    识别所述待识别文本图像中的文本信息。
  19. 根据权利要求18所述的计算机可读存储介质,其中,在计算所述文本预测框与对应的最小外接矩形之间的重叠度之前,所述方法还包括:
    计算所述最小外接矩形内的高置信度像素点的置信度平均值;
    在所述置信度平均值小于预设的筛选阈值时,剔除所述最小外接矩形。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述对全卷积网络模型进行训练优化,包括:
    构建全卷积网络模型;
    标注训练标签,构建训练数据集;
    通过所述训练数据集和预设的损失函数,对所述全卷积网络模型进行训练优化。
PCT/CN2020/131604 2020-03-26 2020-11-26 场景图像中的文本检测方法、装置、计算机设备及存储介质 WO2021189889A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010223195.1 2020-03-26
CN202010223195.1A CN111582021B (zh) 2020-03-26 2020-03-26 场景图像中的文本检测方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2021189889A1 true WO2021189889A1 (zh) 2021-09-30

Family

ID=72124246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131604 WO2021189889A1 (zh) 2020-03-26 2020-11-26 场景图像中的文本检测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111582021B (zh)
WO (1) WO2021189889A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067237A (zh) * 2021-10-28 2022-02-18 清华大学 视频数据处理方法、装置及设备
CN114495103A (zh) * 2022-01-28 2022-05-13 北京百度网讯科技有限公司 文本识别方法、装置、电子设备和介质
CN115375987A (zh) * 2022-08-05 2022-11-22 北京百度网讯科技有限公司 一种数据标注方法、装置、电子设备及存储介质
CN117649635A (zh) * 2024-01-30 2024-03-05 湖北经济学院 狭窄水道场景影消点检测方法、系统及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582021B (zh) * 2020-03-26 2024-07-05 平安科技(深圳)有限公司 场景图像中的文本检测方法、装置及计算机设备
CN111932577B (zh) * 2020-09-16 2021-01-08 北京易真学思教育科技有限公司 文本检测方法、电子设备及计算机可读介质
CN111931784B (zh) * 2020-09-17 2021-01-01 深圳壹账通智能科技有限公司 票据识别方法、系统、计算机设备与计算机可读存储介质
CN112329765B (zh) * 2020-10-09 2024-05-24 中保车服科技服务股份有限公司 文本检测的方法及装置、存储介质及计算机设备
CN112232340A (zh) * 2020-10-15 2021-01-15 马婧 一种物体表面印制信息的识别方法及装置
CN112613561B (zh) * 2020-12-24 2022-06-03 哈尔滨理工大学 一种east算法优化方法
CN112819937B (zh) * 2021-04-19 2021-07-06 清华大学 一种自适应多对象光场三维重建方法、装置及设备
CN113298079B (zh) * 2021-06-28 2023-10-27 北京奇艺世纪科技有限公司 一种图像处理方法、装置、电子设备及存储介质
CN114037826A (zh) * 2021-11-16 2022-02-11 平安普惠企业管理有限公司 基于多尺度增强特征的文本识别方法、装置、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886997A (zh) * 2019-01-23 2019-06-14 平安科技(深圳)有限公司 基于目标检测的识别框确定方法、装置及终端设备
CN109886174A (zh) * 2019-02-13 2019-06-14 东北大学 一种仓库货架标识牌文字识别的自然场景文字识别方法
CN109977943A (zh) * 2019-02-14 2019-07-05 平安科技(深圳)有限公司 一种基于yolo的图像目标识别方法、系统和存储介质
CN110414499A (zh) * 2019-07-26 2019-11-05 第四范式(北京)技术有限公司 文本位置定位方法和系统以及模型训练方法和系统
CN110443140A (zh) * 2019-07-05 2019-11-12 平安科技(深圳)有限公司 文本定位的方法、装置、计算机设备及存储介质
US20200012876A1 (en) * 2017-09-25 2020-01-09 Tencent Technology (Shenzhen) Company Limited Text detection method, storage medium, and computer device
CN111582021A (zh) * 2020-03-26 2020-08-25 平安科技(深圳)有限公司 场景图像中的文本检测方法、装置及计算机设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135424B (zh) * 2019-05-23 2021-06-11 阳光保险集团股份有限公司 倾斜文本检测模型训练方法和票证图像文本检测方法
CN110232713B (zh) * 2019-06-13 2022-09-20 腾讯数码(天津)有限公司 一种图像目标定位修正方法及相关设备
CN110796082B (zh) * 2019-10-29 2020-11-24 上海眼控科技股份有限公司 铭牌文本检测方法、装置、计算机设备和存储介质
CN110874618B (zh) * 2020-01-19 2020-11-27 同盾控股有限公司 基于小样本的ocr模板学习方法、装置、电子设备及介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200012876A1 (en) * 2017-09-25 2020-01-09 Tencent Technology (Shenzhen) Company Limited Text detection method, storage medium, and computer device
CN109886997A (zh) * 2019-01-23 2019-06-14 平安科技(深圳)有限公司 基于目标检测的识别框确定方法、装置及终端设备
CN109886174A (zh) * 2019-02-13 2019-06-14 东北大学 一种仓库货架标识牌文字识别的自然场景文字识别方法
CN109977943A (zh) * 2019-02-14 2019-07-05 平安科技(深圳)有限公司 一种基于yolo的图像目标识别方法、系统和存储介质
CN110443140A (zh) * 2019-07-05 2019-11-12 平安科技(深圳)有限公司 文本定位的方法、装置、计算机设备及存储介质
CN110414499A (zh) * 2019-07-26 2019-11-05 第四范式(北京)技术有限公司 文本位置定位方法和系统以及模型训练方法和系统
CN111582021A (zh) * 2020-03-26 2020-08-25 平安科技(深圳)有限公司 场景图像中的文本检测方法、装置及计算机设备

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067237A (zh) * 2021-10-28 2022-02-18 清华大学 视频数据处理方法、装置及设备
CN114495103A (zh) * 2022-01-28 2022-05-13 北京百度网讯科技有限公司 文本识别方法、装置、电子设备和介质
CN115375987A (zh) * 2022-08-05 2022-11-22 北京百度网讯科技有限公司 一种数据标注方法、装置、电子设备及存储介质
CN115375987B (zh) * 2022-08-05 2023-09-05 北京百度网讯科技有限公司 一种数据标注方法、装置、电子设备及存储介质
CN117649635A (zh) * 2024-01-30 2024-03-05 湖北经济学院 狭窄水道场景影消点检测方法、系统及存储介质
CN117649635B (zh) * 2024-01-30 2024-06-11 湖北经济学院 狭窄水道场景影消点检测方法、系统及存储介质

Also Published As

Publication number Publication date
CN111582021B (zh) 2024-07-05
CN111582021A (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
WO2021189889A1 (zh) 场景图像中的文本检测方法、装置、计算机设备及存储介质
CN109886997B (zh) 基于目标检测的识别框确定方法、装置及终端设备
CN108446698B (zh) 在图像中检测文本的方法、装置、介质及电子设备
US12014556B2 (en) Image recognition method, apparatus, terminal, and storage medium
JP6871314B2 (ja) 物体検出方法、装置及び記憶媒体
WO2018108129A1 (zh) 用于识别物体类别的方法及装置、电子设备
US11037031B2 (en) Image recognition method, electronic apparatus and readable storage medium
WO2020253127A1 (zh) 脸部特征提取模型训练方法、脸部特征提取方法、装置、设备及存储介质
CN112380981B (zh) 人脸关键点的检测方法、装置、存储介质及电子设备
CN110348522B (zh) 一种图像检测识别方法及系统、电子设备、图像分类网络优化方法及系统
US10783643B1 (en) Segmentation-based damage detection
CN111783767B (zh) 文字识别方法、装置、电子设备及存储介质
US11900676B2 (en) Method and apparatus for detecting target in video, computing device, and storage medium
CN110443357B (zh) 卷积神经网络计算优化方法、装置、计算机设备及介质
CN113642584B (zh) 文字识别方法、装置、设备、存储介质和智能词典笔
US20220165064A1 (en) Method for acquiring traffic state, relevant apparatus, roadside device and cloud control platform
WO2022227218A1 (zh) 药名识别方法、装置、计算机设备和存储介质
WO2023279890A1 (zh) 图像处理方法、装置、电子设备及存储介质
CN112417947B (zh) 关键点检测模型的优化及面部关键点的检测方法及装置
CN115131748A (zh) 一种提高雷视一体机目标跟踪识别准确率的方法和系统
CN113516697B (zh) 图像配准的方法、装置、电子设备及计算机可读存储介质
CN112966687B (zh) 图像分割模型训练方法、装置及通信设备
US11393181B2 (en) Image recognition system and updating method thereof
CN109934185B (zh) 数据处理方法及装置、介质和计算设备
CN111985471A (zh) 一种车牌定位方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20926946

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20926946

Country of ref document: EP

Kind code of ref document: A1