CN110163202B - Text region positioning method and device, terminal equipment and medium - Google Patents

Text region positioning method and device, terminal equipment and medium Download PDF

Info

Publication number
CN110163202B
CN110163202B CN201910264868.5A CN201910264868A CN110163202B CN 110163202 B CN110163202 B CN 110163202B CN 201910264868 A CN201910264868 A CN 201910264868A CN 110163202 B CN110163202 B CN 110163202B
Authority
CN
China
Prior art keywords
training
matrix
area
global
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910264868.5A
Other languages
Chinese (zh)
Other versions
CN110163202A (en
Inventor
黄泽浩
王满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910264868.5A priority Critical patent/CN110163202B/en
Publication of CN110163202A publication Critical patent/CN110163202A/en
Application granted granted Critical
Publication of CN110163202B publication Critical patent/CN110163202B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention is suitable for the technical field of artificial intelligence, and provides a text region positioning method, a device, a terminal device and a medium, wherein a selected region after the preset times of region updating operation is output as a text region of a target image by executing the preset times of region updating operation on the target image, and each region updating operation comprises the following steps: the method comprises the steps of obtaining a selected area of a target image, calculating three matrixes for representing global features of the target image, representing features of the selected area and representing historical operations according to features of each pixel point in the target image and a preset neural network model, generating a state matrix based on the three matrixes, generating an operation type corresponding to the state matrix through a preset decision model, updating the current selected area through the operation type, gradually reducing the range of the selected area through one-time adjustment, finally obtaining a text area of the target image, and improving the automation degree of text area positioning.

Description

Text region positioning method and device, terminal equipment and medium
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a text region positioning method, a text region positioning device, terminal equipment and a medium.
Background
Companies often need to acquire data from various types of images during operation, such as: extracting account information from a scanned image of an invoice submitted by an employee; data or the like is extracted from a scanned item of feedback books sent by the customer. The scanned image often contains non-text areas and text areas, and the text recognition operation of the text areas is possibly interfered by the content of the non-text areas, so that the text of the text areas can be automatically recognized by a machine after the text areas are manually locked and segmented from the scanned images in the prior art. Obviously, the existing mode has lower automation degree of locking and dividing the text region, which is unfavorable for the processing efficiency of the whole text recognition work.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a text region positioning method and terminal equipment, so as to solve the problem that text regions cannot be automatically determined from images in the prior art.
A first aspect of an embodiment of the present invention provides a method for positioning a text region, including:
Performing area updating operation for a preset number of times on a target image, and outputting a selected area after the area updating operation for the preset number of times as the text area of the target image; the area update operation includes: acquiring a selected area of a target image; according to the characteristics of each pixel point in the target image, determining a global characteristic matrix of the target image and a local characteristic matrix of the selected area; respectively inputting the global feature matrix and the local feature matrix into a preset neural network to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area; acquiring a historical operation matrix, wherein the historical operation matrix is used for representing operation categories which are made by a plurality of endpoints of the selected area in time sequence, and combining the global convolution feature matrix, the local convolution feature matrix and the historical operation matrix into a state matrix; and inputting the state matrix into a preset decision model, outputting an operation category, updating the historical operation matrix according to the operation category, and adjusting a plurality of endpoints of the selected area according to the operation category so as to update the selected area.
A second aspect of an embodiment of the present invention provides a text region positioning device, including:
The execution module is used for executing area updating operation for the preset times on the target image and outputting the selected area after the area updating operation for the preset times as the text area of the target image; the execution module comprises: the acquisition sub-module is used for acquiring a selected area of the target image; the first matrix generation sub-module is used for determining a global feature matrix of the target image and a local feature matrix of a selected area according to the features of each pixel point in the target image; the second matrix generation sub-module is used for respectively inputting the global feature matrix and the local feature matrix into a preset neural network to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area; a combination sub-module, configured to obtain an initial historical operation matrix, where the historical operation matrix is used to characterize operation types that have been made by a plurality of endpoints of the selected area in time sequence, and combine the global convolution feature matrix, the local convolution feature matrix, and the historical operation matrix into a state matrix; and the updating sub-module is used for inputting the state matrix into a preset decision model, outputting an operation category, adjusting a plurality of endpoints of a selected area according to the operation category so as to update the selected area, and updating a historical operation matrix according to the operation matrix.
A third aspect of an embodiment of the present invention provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method provided by the first aspect of the embodiments of the invention when the computer program is executed by the processor.
A fourth aspect of the embodiments of the present invention provides a computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method provided by the first aspect of the embodiments of the present invention.
In the embodiment of the invention, the selected area after the preset times of area updating operation is output as the text area of the target image by executing the preset times of area updating operation on the target image, and each time of area updating operation comprises the following steps: the method comprises the steps of obtaining a selected area of a target image, calculating three matrixes for representing global features of the target image, representing features of the selected area and representing historical operations according to features of pixel points in the target image and a preset neural network model, generating a state matrix based on the three matrixes, generating an operation type corresponding to the state matrix through a preset decision model, updating and adjusting the current selected area through the operation type, gradually reducing the range of the selected area through one-time adjustment, finally automatically obtaining a text area in the target image, and improving the automation degree of text area positioning.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an implementation of a region update operation provided by an embodiment of the present invention;
Fig. 2 is a flowchart of a specific implementation of the area update operation S103 provided in the embodiment of the present invention;
FIG. 3 is a flowchart of a specific implementation of generating a training data set provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a text field locating device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
Notably, in the embodiment of the present invention, only one text region is included in the target image to be identified, that is, all text on the target image needs to be concentrated only in the text region, and no text can appear outside the text region. For example, in many cases, the letters sent by the customer include background images and text areas, the text areas include information that the letters really want to convey, and the background images play a role in beautifying and anti-counterfeiting. In the embodiment of the invention, the image of the mail can be used as the target image, and the text area is positioned after the area updating operation is carried out for a plurality of times.
Compared with other technologies, the core of the embodiment of the invention is that, due to the fact that some interference factors possibly exist in the background of the target image, the method comprises the following steps: instead of determining the text area in the target image after one calculation, the target image is subjected to one round of area update operation from an initial selected area in the target image (the initial selected area can cover the whole area of the target image), so that the selected area is gradually reduced until the selected area output by the last round is used as the text area of the target image after the preset times of area update operation is performed.
It will be appreciated that since in the embodiment of the present invention, each round of area update operations only changes the boundaries of the selected area a little, and only after a single change to the selected area, the final text area is obtained. Obviously, this has the advantage that if an update is wrong due to interference factors in the target image during one area update operation, and a plurality of subsequent area update operations can be corrected, so that the embodiment of the invention has a higher fault tolerance rate for gradually shrinking the selected area through the preset number of area update operations.
Since the embodiment of the present invention repeatedly performs the region updating operation multiple times, that is, the text region of the target image can be output, a specific flow of each region updating operation will be described in detail hereinafter, and fig. 1 shows a flow of implementing the region updating operation provided by the embodiment of the present invention, and the method flow includes steps S101 to S105. The specific implementation principle of each step is as follows.
In S101: a selected region of the target image is acquired.
In the embodiment of the invention, the target image can be a directly received electronic image, or can be a photo image shot by a camera or a scanning image scanned by a scanner.
It will be appreciated that in each round of area update operation, it is necessary to first acquire a selected area corresponding to the current round, and the task of each round is to adjust the frame of the selected area acquired by the current round, so as to update the selected area. In the embodiment of the invention, the selected area is rectangular and has four frames.
Notably, it is apparent that, when the region updating operation is performed for the first time, the selected region of the acquired target image is a preset initial region, and four frames of the initial region may completely coincide with four frames of the target image, that is, the initial region may be a region covering the entire target image. In the area updating operation of other rounds, the acquired selected area is the updated selected area obtained in the previous round after the area updating operation of the previous round is performed. It will be appreciated that the area updating operations are only successfully iterated since the selected area updated in the previous pass is taken as the selected area acquired in the next pass.
In S102, a global feature matrix of the target image and a local feature matrix of the selected region are determined according to the features of each pixel point in the target image.
It may be understood that the feature of each pixel point may be an RGB value corresponding to each pixel point, so the step of constructing the global feature matrix of the target image may be: first,: 3 matrixes are respectively constructed based on three layers of the target image RGB, namely, an R layer corresponds to one matrix, a G layer corresponds to one matrix, a B layer corresponds to one matrix, and the values of elements in each matrix are 0-255. Secondly, fusing matrixes corresponding to the layers, wherein the fusion mode is as follows: and expanding rows of the matrix corresponding to the R layer, filling two rows of blank rows between each row, and importing each row of the matrix corresponding to the other two layers into each blank row of the matrix corresponding to the expanded R layer according to the sequence of row numbers to form a 3 MxN matrix, wherein M is the row number of the pixel points of the target image, N is the column number of the pixel points of the target image, and the 3 MxN matrix is used as the global feature matrix corresponding to the target image.
It will be appreciated that since the selected region is a local part of the target image (the initial selected region may be global to the target image in the first round of region updating operations), the local feature matrix corresponding to the selected region may be constructed according to the same method as described in the previous paragraph for constructing the global feature matrix.
It will be appreciated that the characteristic of each pixel may also be other parameters corresponding to each pixel, such as a gray value of each image.
In S103, the global feature matrix and the local feature matrix are respectively input into a preset neural network, so as to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area.
In the embodiment of the invention, a neural network needs to be trained in advance, the neural network can be a convolutional neural network model, and a specific training process can be as follows:
The first step, a plurality of preset training feature matrixes and training convolution feature matrixes corresponding to the training feature matrixes are obtained. And step two, repeatedly executing the following steps until the cross entropy loss function value of the updated neural network is smaller than a preset loss threshold value: and taking the training feature matrix as input of the neural network, taking the training convolution feature matrix as output of the neural network, updating parameters of each layer of the neural network through the existing random gradient descent method, and calculating the cross entropy loss function value of the updated neural network.
It can be understood that after the global feature matrix and the local feature matrix are obtained in the above manner, only the two matrices are input into a preset neural network, so that the global convolution feature matrix and the local convolution feature matrix which can reflect the local features of the global and the selected area of the target image can be obtained.
Optionally, the neural network in the embodiment of the present invention includes a plurality of 3×3 convolutional layers, a pooling layer, and a full-connection layer, where each convolutional layer corresponds to a convolutional layer number according to its order from front to back in the neural network.
As an embodiment of the present invention, as shown in fig. 2, S103 includes:
S1031, respectively importing the global feature matrix and the local feature matrix into a preset convolutional neural network, and extracting data output by one convolutional layer from the convolutional layer with the largest convolutional layer number in the convolutional neural network at each interval by a first preset number of convolutional layer numbers to serve as global selected data and local selected data respectively.
Optionally, the preset neural network may include 11 convolution layers, where the 1 st layer and the 2 nd layer of the model are 3×3 convolution layers, the step size of the convolution is 1, and the number of characteristic channels is 32; the 3 rd layer of the model is a convolution layer of 3x3, the step length of convolution is 2, and the number of characteristic channels is 64; the 4 th and 5 th layers of the model are convolution layers of 3x3, the step length of convolution is 1, and the number of characteristic channels is 64; the 6 th layer of the model is a convolution layer of 3x3, the step length of convolution is 2, and the number of characteristic channels is 128; the 7 th and 8 th layers of the model are convolution layers of 3x3, the step length of convolution is 1, and the number of characteristic channels is 128; the 9 th layer of the model is a convolution layer of 3x3, the step length of convolution is 2, and the number of characteristic channels is 256; layers 10 and 11 of the model are 3x3 convolution layers, the step size of the convolution is 1, and the number of characteristic channels is 256.
Optionally, after the global feature matrix is input, data output by one convolution layer is extracted from the 11 th convolution layer every 3 convolution layers, and is used as global selected data, so that data output by the 11 th, 8 th, 5 th and 2 nd layers are extracted as global selected data.
Optionally, after the local feature matrix is input, data output by one convolution layer is extracted from the 11 th convolution layer every 3 convolution layers, and is used as local selected data, so that data output by the 11 th, 8 th, 5 th and 2 nd layers are extracted as local selected data.
S1032, global average pooling is carried out on the second preset number of global selected data and the local selected data respectively, and the second preset number of global pooling vectors and local pooling vectors are generated.
For example, assuming that the second preset number is 3, global average pooling operations are performed on the data of the 11 th layer, the 8 th layer and the 5 th layer respectively, so as to obtain a 64-dimensional pooling vector, a 128-dimensional pooling vector and a 256-dimensional pooling vector.
It will be appreciated that if a global feature matrix is input to the neural network, 3 global pooling vectors are obtained in this step, and if a local feature matrix is input to the neural network, 3 local pooling vectors are obtained in this step.
S1033, splicing the second preset number of global pooling vectors to generate a total global pooling vector, and splicing the second preset number of local pooling vectors to generate a total local pooling vector.
Illustratively, the 64-dimensional global pooling vector, the 128-dimensional global pooling vector, and the 256-dimensional global pooling vector are spliced to generate a 448-dimensional total global pooling vector.
S1034, respectively inputting the total global pooling vector and the total local pooling vector into a full connection layer of the convolutional neural network, and outputting a global convolutional feature matrix corresponding to the target image and a local convolutional feature matrix corresponding to the selected area.
It can be appreciated that the input 448-dimensional global pooling vector or the input 448-dimensional local pooling vector can be converted into a global convolution feature matrix or a local convolution feature matrix through the fully-connected layer of the preset neural network, and the calculation principle of the fully-connected layer of the convolution neural network is the prior art, so that the details and the specific limitations are not described herein.
It will be appreciated that global features of the target image and local features of the selected region may be enhanced by the neural network described above.
In S104, a history operation matrix is obtained, where the history operation matrix is used to characterize operation categories that have been made by a plurality of endpoints of the selected area in time sequence, and the global convolution feature matrix, the local convolution feature matrix, and the history operation matrix are combined into a state matrix.
Notably, since the embodiments of the present invention are directed to a round of updating the selected area, the text area is ultimately determined, and each update to the selected area is actually accomplished by moving the frame of the selected area. Assuming that the selected area remains rectangular at all times, the selected area contains 4 borders and 4 end points, and obviously, as long as the movement of the two end points of the selected area, which are diagonal to each other, is controlled, the movement of the 4 borders can be controlled. Each endpoint has 5 operation categories, which are respectively a preset number of pixels moving rightward, a preset number of pixels moving leftward, a preset number of pixels moving upward, a preset number of pixels moving downward, and no movement, and it can be understood that the operation categories of the two endpoints are arranged and combined in pairs, and a total of 5×5=25 arrangement and combination modes are adopted, so in the embodiment of the present invention, the operation category calculated in each round is one of the 25 operation categories.
Since one of 25 operation classes is obtained for each round, in the embodiment of the present invention, the historical operation classes obtained before the area update operation of the current round need to be counted in time sequence and represented by a matrix, where the matrix used to represent the operation classes that have been made by the endpoints of the selected area in time sequence is the historical operation matrix.
Alternatively, since the embodiment of the present invention needs to repeat the area updating operation for a preset number of times, only one operation class is obtained in each round of the area updating operation, the total available operation classes are limited in practice, that is, the total available operation classes have the same value as the preset number of times. On the other hand, the number of operational categories available per round is also limited, namely 25. Therefore, the number of rows of the history operation matrix in the embodiment of the present invention may be 25, and the number of columns may be a value of a preset number of times. Thus, each cycle of operation type corresponds to a column of data of the historical operation matrix, and each operation type corresponds to a row of data of the historical operation matrix. It will be appreciated that after each round of area update, the elements of the column corresponding to that round are updated in the following manner: the element of the row corresponding to the operation category obtained for the round is marked as 1, and the other elements of the column remain as 0. All elements of the column corresponding to the round that did not go to are 0.
It will be appreciated that the global convolution feature matrix, the local convolution feature matrix and the history operation matrix may be combined into a state matrix according to existing matrix combinations.
In S105, the state matrix is input into a preset decision model, an operation class is output, the history operation matrix is updated according to the operation class, and a plurality of endpoints of the selected area are adjusted according to the operation class, so as to update the selected area.
Alternatively, the decision model is formulated by: Calculating a probability matrix corresponding to the state matrix; the (j) is a probability value corresponding to a j-th element in the probability matrix; z j is a parameter corresponding to the j-th element in a preset parameter matrix; the M is the number of elements in the parameter matrix, the x i is the i-th element in the state matrix, and the e is a natural constant.
In the embodiment of the invention, the class corresponding to the element with the largest element value in the probability matrix is used as the operation class of the output of the current round, and the selected range and the history operation matrix are updated through the operation class. The updating manner of the history operation matrix is described in detail in S104 above. It will be appreciated that the updated selected range, along with the history matrix of operations, will be taken to the next round of area update operations for calculation. Obviously, if the current round is already the last round of the area update operation of the preset number of times, the updated selected range is the text area of the target image which should be output last.
It will be appreciated that the individual elements of the parameter matrix described above are in fact parameters in the decision model obtained by training the training data.
Optionally, the decision model in the embodiment of the present invention may be a Long Short-Term Memory (LSTM) neural network, so the training process for the decision model includes:
Firstly, acquiring a plurality of groups of training data sets, wherein each group of training data sets comprises a training global convolution feature matrix for representing global features of a training image, a training local convolution feature matrix for representing features of a local area in the training image, a training operation record matrix for representing operation types which need to be sequentially executed when the training image is contracted from one area to another area, and training operation types which need to be continuously executed;
Secondly, generating a training state matrix according to the training global convolution feature matrix, the training local convolution feature matrix and the training operation record matrix, taking the state matrix as the input of the long-period memory network, taking the training operation type as the output of the long-period memory network, and adjusting each learning parameter in the long-period memory network so as to enable the long-period memory network to meet convergence conditions; the convergence condition is as follows:
Wherein θ * is the learning parameter after adjustment; sta is the training state matrix; atc is the training operation class; p (sta|atc; theta) is a probability value obtained by importing a training state matrix into the LSTM neural network and outputting a result as the training operation class when the value of the learning parameter is theta; arg max θStc logp (sta|atc; θ) is the value of the learning parameter when the probability value takes the maximum value;
and finally, taking the adjusted long-short-term memory network as the decision model.
It can be appreciated that in the foregoing training process, the key of the training decision model is to collect an accurate training data set, and the embodiment of the present invention further provides a method for generating a training data set, where the method needs to obtain a training image and a candidate operation class set, where the training image is an image with a known text area, and the candidate operation class set includes a plurality of operation classes that can be used to adjust a selected area in the training image; repeatedly executing the preset times of cyclic operation, and outputting a plurality of groups of training data sets, wherein the steps of each round of cyclic operation are shown in fig. 3, and include S301-S305, which are described in detail as follows:
In S301, an initial training selected area is selected from the training image, and the current training selected area is adjusted according to each operation category in the candidate operation category set, so as to generate an adjusted training selected area corresponding to each operation category.
It will be appreciated that since the decision model trained with the training data is ultimately intended to be able to derive the corresponding class of operation for the input state matrix, steps similar to those described above for the region update operation need to be taken during the collection of the training data.
In the embodiment of the present invention, the candidate operation category set includes 25 operation categories, and the description of these 25 operation categories is detailed above and will not be repeated here.
In S302, according to the overlapping area of the text region and the adjusted training selected region corresponding to each operation category, an overlapping parameter corresponding to each operation category is calculated.
Alternatively, the formula is passed: And calculating coincidence parameters corresponding to each operation category, wherein Co i is the coincidence parameter corresponding to the ith operation category in the candidate operation category set, select i is an adjusted training selected area corresponding to the operation category, text is the Text area, U is the intersection area of the two calculated areas, and U is the union area of the two calculated areas.
It will be appreciated that since the training image is a known image of a text region, it can be determined whether the area of overlap of the text region with the adjusted training selected region is increased or decreased as compared to before the adjustment, and the magnitude of the change in the calculated position, after the training selected region is adjusted according to each operation category in the candidate operation category set.
In S303, the operation type with the largest corresponding coincidence parameter is used as the selected operation type, and the training selected area adjusted by the selected operation type is used as the selected training area.
In S304, a training operation record matrix representing the operation types that need to be sequentially executed when the initial training selected area is reduced to the selected training area is generated, and a training local convolution feature matrix corresponding to the selected training area and a training global convolution feature matrix corresponding to the training image are generated.
In S305, the training global convolution feature matrix, the training local convolution feature matrix, the training operation record matrix, and the selected operation class are combined into a set of training data sets.
It will be appreciated that training data for training the decision model may be collected in the manner described above.
Corresponding to the text region positioning method described in the above embodiment, fig. 4 shows a block diagram of the text region positioning device provided in the embodiment of the present invention, and for convenience of explanation, only the portion relevant to the embodiment of the present invention is shown.
Referring to fig. 4, the apparatus includes:
The execution module 401 is configured to execute a preset number of area update operations on a target image, and output a selected area after the preset number of area update operations as the text area of the target image.
The execution module comprises:
An acquisition submodule 4011 for acquiring a selected region of the target image;
an extraction submodule 4012, configured to determine a global feature matrix of the target image and a local feature matrix of the selected area according to features of each pixel point in the target image;
A calculation submodule 4013, configured to input the global feature matrix and the local feature matrix into a preset neural network respectively, and generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area;
A combining submodule 4014, configured to obtain a historical operation matrix, where the historical operation matrix is used to characterize operation categories that have been made by a plurality of endpoints of the selected area in time sequence, and combine the global convolution feature matrix, the local convolution feature matrix, and the historical operation matrix into a state matrix;
The updating submodule 4015 is configured to input the state matrix into a preset decision model, output an operation class, update the historical operation matrix according to the operation class, and adjust a plurality of endpoints of the selected area according to the operation class so as to update the selected area.
The calculating submodule is specifically configured to:
Respectively importing a global feature matrix and the local feature matrix into a preset convolutional neural network, and extracting data output by one convolutional layer from the convolutional layer with the largest convolutional layer number in the convolutional neural network at each interval by a first preset number of convolutional layer numbers to serve as global selected data and local selected data respectively;
respectively carrying out global average pooling on a second preset number of global selected data and local selected data to generate a second preset number of global pooling vectors and local pooling vectors;
splicing the second preset number of global pooling vectors to generate a total global pooling vector, and splicing the second preset number of local pooling vectors to generate a total local pooling vector;
and respectively inputting the total global pooling vector and the total local pooling vector into a full-connection layer of the convolutional neural network, and outputting a global convolutional feature matrix corresponding to the target image and a local convolutional feature matrix corresponding to the selected area.
The apparatus further comprises: training module for:
obtaining multiple sets of training data sets, wherein each set of training data sets comprises a training global convolution feature matrix used for representing global features of a training image, a training local convolution feature matrix used for representing features of a local area in the training image, a training operation record matrix used for representing operation types required to be sequentially executed when one area is reduced to another area in the training image, and training operation types required to be continuously executed;
Generating a training state matrix according to the training global convolution feature matrix, the training local convolution feature matrix and the training operation record matrix, taking the state matrix as the input of the long-period memory network, taking the training operation type as the output of the long-period memory network, and adjusting each learning parameter in the long-period memory network so as to enable the long-period memory network to meet convergence conditions; the convergence condition is as follows:
Wherein θ * is the learning parameter after adjustment; sta is the training state matrix; atc is the training operation class; p (sta|atc; theta) is a probability value obtained by importing a training state matrix into the LSTM neural network and outputting a result as the training operation class when the value of the learning parameter is theta; arg max θStc logp (sta|atc; θ) is the value of the learning parameter when the probability value takes the maximum value;
And taking the adjusted long-term and short-term memory network as the decision model.
The apparatus further comprises: training data collection module for
Acquiring a training image and a candidate operation class set, wherein the training image is an image with a known text area, and the candidate operation class set comprises a plurality of operation classes which can be used for adjusting a selected area in the training image;
Repeatedly executing the circulation operation of the preset times, and outputting a plurality of groups of training data sets, wherein each round of the circulation operation comprises the following steps:
Selecting an initial training selected area from the training image, adjusting the current training selected area according to each operation category in the candidate operation category set, and generating an adjusted training selected area corresponding to each operation category;
Calculating coincidence parameters corresponding to each operation category according to the coincidence area of the character area and the adjusted training selected area corresponding to each operation category;
taking the operation type with the corresponding maximum coincidence parameter as a selected operation type, and taking the training selected area adjusted by the selected operation type as a selected training area;
Generating a training operation record matrix representing operation types which need to be sequentially executed from the initial training selected area to the selected training area, and generating a training local convolution feature matrix corresponding to the selected training area and a training global convolution feature matrix corresponding to the training image; and combining the training global convolution feature matrix, the training local convolution feature matrix, the training operation record matrix and the selected operation category into a set of training data sets.
It can be understood that, in the embodiment of the present invention, by performing the area update operation for the target image for the preset number of times, the selected area after the area update operation for the preset number of times is output as the text area of the target image, and each area update operation includes the steps of: the method comprises the steps of obtaining a selected area of a target image, calculating three matrixes for representing global features of the target image, representing features of the selected area and representing historical operations according to features of each pixel point in the target image and a preset neural network model, generating a state matrix based on the three matrixes, generating an operation type corresponding to the state matrix through a preset decision model, updating the current selected area through the operation type, gradually reducing the range of the selected area through one-time adjustment, finally obtaining a text area of the target image, and improving the automation degree of text area positioning.
Fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 5, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52, such as a location program for text areas, stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps of the above-described embodiments of the text region locating method, such as steps 101 to 105 shown in fig. 1. Or the processor 50, when executing the computer program 52, performs the functions of the modules/units of the apparatus embodiments described above, e.g., the functions of the module 401 shown in fig. 4.
By way of example, the computer program 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 52 in the terminal device 5.
The terminal device 5 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the terminal device 5 and does not constitute a limitation of the terminal device 5, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The Processor 50 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 51 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. The specific names of the functional units and modules are also for convenience of distinction, and are not intended to limit the protection area of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the protection areas of the present invention.

Claims (10)

1. The method for positioning the text area is characterized by comprising the following steps:
performing area updating operation for a preset number of times on a target image, and outputting a selected area after the area updating operation for the preset number of times as the text area of the target image;
The area update operation includes:
Acquiring a selected area of a target image;
according to the characteristics of each pixel point in the target image, determining a global characteristic matrix of the target image and a local characteristic matrix of the selected area;
Respectively inputting the global feature matrix and the local feature matrix into a preset neural network to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area;
Acquiring a historical operation matrix, wherein the historical operation matrix is used for representing operation categories which are made by a plurality of endpoints of the selected area in time sequence, and combining the global convolution feature matrix, the local convolution feature matrix and the historical operation matrix into a state matrix;
And inputting the state matrix into a preset decision model, outputting an operation category, updating the historical operation matrix according to the operation category, and adjusting a plurality of endpoints of the selected area according to the operation category so as to update the selected area.
2. The method for locating a text region according to claim 1, wherein the step of inputting the global feature matrix and the local feature matrix into a predetermined neural network to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected region includes:
Respectively importing a global feature matrix and the local feature matrix into a preset convolutional neural network, and extracting data output by one convolutional layer from the convolutional layer with the largest convolutional layer number in the convolutional neural network at each interval by a first preset number of convolutional layer numbers to serve as global selected data and local selected data respectively;
respectively carrying out global average pooling on a second preset number of global selected data and local selected data to generate a second preset number of global pooling vectors and local pooling vectors;
splicing the second preset number of global pooling vectors to generate a total global pooling vector, and splicing the second preset number of local pooling vectors to generate a total local pooling vector;
and respectively inputting the total global pooling vector and the total local pooling vector into a full-connection layer of the convolutional neural network, and outputting a global convolutional feature matrix corresponding to the target image and a local convolutional feature matrix corresponding to the selected area.
3. The method for locating text areas according to claim 1, further comprising, prior to said capturing the target image:
obtaining multiple sets of training data sets, wherein each set of training data sets comprises a training global convolution feature matrix used for representing global features of a training image, a training local convolution feature matrix used for representing features of a local area in the training image, a training operation record matrix used for representing operation types required to be sequentially executed when one area is reduced to another area in the training image, and training operation types required to be continuously executed;
generating a training state matrix according to the training global convolution feature matrix, the training local convolution feature matrix and the training operation record matrix, taking the training state matrix as the input of a long-period memory network, taking the training operation type as the output of the long-period memory network, and adjusting each learning parameter in the long-period memory network so as to enable the long-period memory network to meet convergence conditions; the convergence condition is as follows:
θ*=argmaxθstclogp(sta|atc;θ)
Wherein θ * is the learning parameter after adjustment; sta is the training state matrix; atc is the training operation class; p (sta|atc; theta) is a probability value obtained by importing a training state matrix into an LSTM neural network and outputting a result as the training operation class when the value of the learning parameter is theta; arg max θStc logp (sta|atc; θ) is the value of the learning parameter when the probability value takes the maximum value;
And taking the adjusted long-term and short-term memory network as the decision model.
4. The method for locating text areas according to claim 3, further comprising:
Acquiring a training image and a candidate operation class set, wherein the training image is an image with a known text area, and the candidate operation class set comprises a plurality of operation classes which can be used for adjusting a selected area in the training image;
Repeatedly executing the circulation operation of the preset times, and outputting a plurality of groups of training data sets, wherein each round of the circulation operation comprises the following steps:
Selecting an initial training selected area from the training image, adjusting the current training selected area according to each operation category in the candidate operation category set, and generating an adjusted training selected area corresponding to each operation category;
Calculating coincidence parameters corresponding to each operation category according to the coincidence area of the character area and the adjusted training selected area corresponding to each operation category;
taking the operation type with the corresponding maximum coincidence parameter as a selected operation type, and taking the training selected area adjusted by the selected operation type as a selected training area;
Generating a training operation record matrix representing operation types which need to be sequentially executed from the initial training selected area to the selected training area, and generating a training local convolution feature matrix corresponding to the selected training area and a training global convolution feature matrix corresponding to the training image; and combining the training global convolution feature matrix, the training local convolution feature matrix, the training operation record matrix and the selected operation category into a set of training data sets.
5. The method for locating text areas according to claim 4, wherein calculating the coincidence parameter corresponding to each operation category according to the coincidence area of the text area and the adjusted training selected area corresponding to each operation category comprises:
By the formula And calculating coincidence parameters corresponding to each operation category, wherein Co i is the coincidence parameter corresponding to the ith operation category in the candidate operation category set, select i is an adjusted training selected area corresponding to the operation category, text is the Text area, U is the intersection area of the two calculated areas, and U is the union area of the two calculated areas.
6. A text field locating device, the device comprising:
the execution module is used for executing area updating operation for the preset times on the target image and outputting the selected area after the area updating operation for the preset times as the text area of the target image;
The execution module comprises:
The acquisition sub-module is used for acquiring a selected area of the target image;
The first matrix generation sub-module is used for determining a global feature matrix of the target image and a local feature matrix of a selected area according to the features of each pixel point in the target image;
The second matrix generation sub-module is used for respectively inputting the global feature matrix and the local feature matrix into a preset neural network to generate a global convolution feature matrix corresponding to the target image and a local convolution feature matrix corresponding to the selected area;
a combination sub-module, configured to obtain an initial historical operation matrix, where the historical operation matrix is used to characterize operation types that have been made by a plurality of endpoints of the selected area in time sequence, and combine the global convolution feature matrix, the local convolution feature matrix, and the historical operation matrix into a state matrix;
And the updating sub-module is used for inputting the state matrix into a preset decision model, outputting an operation category, adjusting a plurality of endpoints of a selected area according to the operation category so as to update the selected area, and updating a historical operation matrix according to the operation matrix.
7. The text field locating device of claim 6, wherein the first matrix generation sub-module is specifically configured to:
The extraction submodule is used for respectively importing the global feature matrix and the local feature matrix into a preset convolutional neural network, and extracting data output by one convolutional layer from the convolutional layer with the largest convolutional layer number in the convolutional neural network at each interval by a first preset number of convolutional layer numbers to serve as global selected data and local selected data respectively;
Chi Huazi module, configured to perform global average pooling on a second preset number of global selected data and local selected data, to generate a second preset number of global pooling vectors and local pooling vectors;
the splicing sub-module is used for splicing the second preset number of global pooling vectors to generate a total global pooling vector, and splicing the second preset number of local pooling vectors to generate a total local pooling vector;
And the output sub-module is used for respectively inputting the total global pooling vector and the total local pooling vector into the full-connection layer of the convolutional neural network and outputting a global convolutional feature matrix corresponding to the target image and a local convolutional feature matrix corresponding to the selected area.
8. The text field locating device of claim 7, further comprising:
The training module is used for acquiring a plurality of groups of training data sets, wherein each group of training data sets comprises a training global convolution feature matrix used for representing global features of a training image, a training local convolution feature matrix used for representing features of a local area in the training image, a training operation record matrix used for representing operation types which need to be sequentially executed when the training image is contracted from one area to another area, and training operation types which need to be continuously executed;
generating a training state matrix according to the training global convolution feature matrix, the training local convolution feature matrix and the training operation record matrix, taking the training state matrix as the input of a long-period memory network, taking the training operation type as the output of the long-period memory network, and adjusting each learning parameter in the long-period memory network so as to enable the long-period memory network to meet convergence conditions; the convergence condition is as follows:
θ*=arg maxθstclogp(sta|atc;θ)
Wherein θ * is the learning parameter after adjustment; sta is the training state matrix; atc is the training operation class; p (sta|atc; theta) is a probability value obtained by importing a training state matrix into an LSTM neural network and outputting a result as the training operation class when the value of the learning parameter is theta; arg max θStc logp (sta|atc; θ) is the value of the learning parameter when the probability value takes the maximum value;
And taking the adjusted long-term and short-term memory network as the decision model.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 5 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 5.
CN201910264868.5A 2019-04-03 2019-04-03 Text region positioning method and device, terminal equipment and medium Active CN110163202B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910264868.5A CN110163202B (en) 2019-04-03 2019-04-03 Text region positioning method and device, terminal equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910264868.5A CN110163202B (en) 2019-04-03 2019-04-03 Text region positioning method and device, terminal equipment and medium

Publications (2)

Publication Number Publication Date
CN110163202A CN110163202A (en) 2019-08-23
CN110163202B true CN110163202B (en) 2024-06-04

Family

ID=67638922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910264868.5A Active CN110163202B (en) 2019-04-03 2019-04-03 Text region positioning method and device, terminal equipment and medium

Country Status (1)

Country Link
CN (1) CN110163202B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446899A (en) * 2016-09-22 2017-02-22 北京市商汤科技开发有限公司 Text detection method and device and text detection training method and device
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN109034152A (en) * 2018-07-17 2018-12-18 广东工业大学 License plate locating method and device based on LSTM-CNN built-up pattern
CN109492630A (en) * 2018-10-26 2019-03-19 信雅达系统工程股份有限公司 A method of the word area detection positioning in the financial industry image based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8031940B2 (en) * 2006-06-29 2011-10-04 Google Inc. Recognizing text in images using ranging data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN106446899A (en) * 2016-09-22 2017-02-22 北京市商汤科技开发有限公司 Text detection method and device and text detection training method and device
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
CN109034152A (en) * 2018-07-17 2018-12-18 广东工业大学 License plate locating method and device based on LSTM-CNN built-up pattern
CN109492630A (en) * 2018-10-26 2019-03-19 信雅达系统工程股份有限公司 A method of the word area detection positioning in the financial industry image based on deep learning

Also Published As

Publication number Publication date
CN110163202A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN110020620B (en) Face recognition method, device and equipment under large posture
CN110163193B (en) Image processing method, image processing device, computer-readable storage medium and computer equipment
JP6731529B1 (en) Single-pixel attack sample generation method, device, equipment and storage medium
US11004181B2 (en) Systems and methods for image data processing to correct document deformations using machine-learning techniques
CN108701234A (en) Licence plate recognition method and cloud system
JP7026165B2 (en) Text recognition method and text recognition device, electronic equipment, storage medium
CN110046622B (en) Targeted attack sample generation method, device, equipment and storage medium
US10438083B1 (en) Method and system for processing candidate strings generated by an optical character recognition process
CN112232426B (en) Training method, device and equipment of target detection model and readable storage medium
CN112926565B (en) Picture text recognition method, system, equipment and storage medium
CN111104941B (en) Image direction correction method and device and electronic equipment
CN111898735A (en) Distillation learning method, distillation learning device, computer equipment and storage medium
CN111680480A (en) Template-based job approval method and device, computer equipment and storage medium
CN114187483A (en) Method for generating countermeasure sample, training method of detector and related equipment
CN107704890A (en) A kind of generation method and device of four-tuple image
CN114359938A (en) Form identification method and device
CN109359542B (en) Vehicle damage level determining method based on neural network and terminal equipment
CN113743443B (en) Image evidence classification and recognition method and device
CN111369489A (en) Image identification method and device and terminal equipment
CN110163202B (en) Text region positioning method and device, terminal equipment and medium
CN111222558A (en) Image processing method and storage medium
US20230401809A1 (en) Image data augmentation device and method
CN111325210B (en) Method and device for outputting information
CN112288748B (en) Semantic segmentation network training and image semantic segmentation method and device
CN113920511A (en) License plate recognition method, model training method, electronic device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant