WO2021146937A1 - 文字识别方法、文字识别装置和存储介质 - Google Patents
文字识别方法、文字识别装置和存储介质 Download PDFInfo
- Publication number
- WO2021146937A1 WO2021146937A1 PCT/CN2020/073576 CN2020073576W WO2021146937A1 WO 2021146937 A1 WO2021146937 A1 WO 2021146937A1 CN 2020073576 W CN2020073576 W CN 2020073576W WO 2021146937 A1 WO2021146937 A1 WO 2021146937A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- group
- feature map
- convolution
- text box
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 146
- 238000001514 detection method Methods 0.000 claims abstract description 273
- 238000012937 correction Methods 0.000 claims abstract description 166
- 238000005070 sampling Methods 0.000 claims description 262
- 230000009467 reduction Effects 0.000 claims description 257
- 238000012545 processing Methods 0.000 claims description 218
- 238000013528 artificial neural network Methods 0.000 claims description 125
- 230000004927 fusion Effects 0.000 claims description 73
- 230000006870 function Effects 0.000 claims description 58
- 238000013519 translation Methods 0.000 claims description 57
- 238000012549 training Methods 0.000 claims description 45
- 230000004044 response Effects 0.000 claims description 44
- 238000007499 fusion processing Methods 0.000 claims description 31
- 238000004422 calculation algorithm Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 17
- 230000009466 transformation Effects 0.000 claims description 6
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 27
- 238000004364 calculation method Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 229910003460 diamond Inorganic materials 0.000 description 3
- 239000010432 diamond Substances 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
Definitions
- ss1 is 5 and ss2 is 2.
- the first angle threshold is 80 degrees
- the second angle threshold is 10 degrees
- performing text detection on each intermediate input image to obtain the intermediate text box group corresponding to each intermediate input image includes: using a text detection neural network Perform text detection on each intermediate input image to determine the text detection area group corresponding to each intermediate input image; use the smallest bounding rectangle algorithm to process the text detection area group to determine the intermediate text box Group, wherein the text detection area group includes at least one text detection area, the at least one text detection area corresponds to the at least one middle text box one-to-one, and each middle text box covers a corresponding text detection area .
- the text recognition neural network is a multi-target corrective attention network.
- the character recognition method provided by at least one embodiment of the present disclosure further includes: translating the target text to obtain and output the translation result of the target text.
- FIG. 11B is a schematic diagram of a model result of a text detection neural network based on a focus loss function provided by at least one embodiment of the present disclosure
- FIG. 13 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
- S104 Recognize the final target text box to obtain the target text.
- step S1011 for the case where the pixel connection algorithm does not adapt to changes in the text scale in the input image, the input image can be transformed with different scales to construct an image pyramid (ie, multiple intermediate input images), so that various text The standards can be met, and the accuracy of text detection can be improved at the same time.
- an image pyramid ie, multiple intermediate input images
- the plurality of intermediate input images may include input images, and the sizes of the plurality of intermediate input images are different from each other.
- the size of the input image is W*H, that is, the width of the input image is W, the height of the input image is H, and the input image is scaled to adjust the size of the input image to 1.5. *(W*H), 0.8*(W*H), 0.6*(W*H), 0.4*(W*H) to obtain multiple intermediate input images.
- the plurality of intermediate input images may include a first intermediate input image, a second intermediate input image, a third intermediate input image, a fourth intermediate input image, and a fifth intermediate input image.
- FIG. 2A shows the first intermediate input image
- Fig. 2B shows the second intermediate input image
- the size of the second intermediate input image is 0.6*(W*H)
- Fig. 2C shows The third intermediate input image
- the size of the third intermediate input image is 0.8*(W*H)
- Figure 2D shows the fourth intermediate input image
- the size of the fourth intermediate input image is (W*H), that is That is, the fourth intermediate input image is the input image, the image shown in FIG.
- the number of middle text boxes in the middle text box group corresponding to the fifth middle input image may be 8.
- the text group contained in the intermediate text box of the intermediate text box group corresponding to the first intermediate input image includes text: “ur”, “of”, “French”, “Spring's”, “studio”, “to”, “view” and “desig”;
- the text group contained in the middle text box of the middle text box group corresponding to the fifth middle input image also includes text: “ur”, “of “, “French”, “Spring's", “studio”, “to”, “view” and “desig”.
- the middle text box including "ur” corresponding to the first intermediate input image and the middle text box including “ur” corresponding to the fifth intermediate input image correspond to each other
- the intermediate text including "French” corresponding to the first intermediate input image correspond to each other, and so on.
- performing text detection on each intermediate input image to obtain the intermediate text box group corresponding to each intermediate input image includes: performing text detection on each intermediate input image using a text detection neural network to determine each intermediate input image.
- the text detection area group corresponding to the middle input image; the text detection area group is processed by the smallest enclosing rectangle algorithm to determine the middle text box group.
- a text detection neural network can use a pixel link (PixelLink) algorithm for text detection.
- pixel link PixelLink
- the text detection area group includes at least one text detection area, at least one text detection area corresponds to at least one middle text box, and each middle text box includes a corresponding text detection area, that is, the middle text box covers the corresponding The text detection area.
- findContours OpenCV-based contour detection
- the text detection neural network can use the VGG16 network as the feature extractor, and replace the fully connected layer in the VGG16 network with a convolutional layer.
- the method of feature fusion and pixel prediction is based on the idea of FPN (feature pyramid network, pyramid feature network), that is, the size of the convolutional layer in the text detection neural network is halved, but the convolution kernel in the convolutional layer The number doubled in turn.
- FPN feature pyramid network, pyramid feature network
- the text detection neural network may include a first convolution module 301 to a fifth convolution module 305, a first down-sampling module 306 to a fifth down-sampling module 310, and a fully connected
- the first convolution module 301 may include two convolution layers conv1_1 and conv1_2, and each convolution layer in the first convolution module 301 includes 8 convolution kernels;
- the second convolution module 302 may include two convolution layers. Build layers conv2_1 and conv2_2, each convolution layer in the second convolution module 302 includes 16 convolution kernels;
- the third convolution module 303 may include three convolution layers conv3_1 to conv3_3, in the third convolution module 303 Each convolution layer of includes 32 convolution kernels;
- the fourth convolution module 304 may include three convolution layers conv4_1 to conv4_3, and each convolution layer in the fourth convolution module 304 includes 64 convolution kernels;
- the fifth convolution module 305 may include three convolution layers conv5_1 to conv5_3, and each convolution layer in the fifth convolution module 305 includes 128 convolution kernels.
- each convolutional layer includes an activation function.
- the activation function may be a
- using a text detection neural network to perform text detection on each intermediate input image to determine the text detection area group corresponding to each intermediate input image includes: using a first convolution module to perform convolution processing on each intermediate input image to Obtain the first convolution feature map group; use the first down-sampling module to perform down-sampling processing on the first convolution feature map group to obtain the first down-sampled feature map group; use the second convolution module to down-sample the first feature
- the image group is subjected to convolution processing to obtain the second convolution feature image group; the second downsampling module is used to downsample the second convolution feature image group to obtain the second downsampled feature image group; the third volume is used
- the product module performs convolution processing on the second down-sampled feature map group to obtain the third convolution feature map group; uses the third down-sampling module to perform down-sampling processing on the third convolution feature map group to obtain the third down-sampling Feature map group, and use the first dimensionality reduction module to
- the size of each intermediate input image may be 512*512, the number of channels is 3, and the 3 channels are respectively a red channel, a blue channel, and a green channel.
- the number of feature maps in the first convolution feature map group CN1 is 8, and the size of each feature map in the first convolution feature map group CN1 may be 512*512;
- second The number of feature maps in the convolution feature map group CN2 is 16, and the size of each feature map in the second convolution feature map group CN2 can be 256*256;
- the feature maps in the third convolution feature map group CN3 The number of feature maps is 32, and the size of each feature map in the third convolution feature map group CN3 can be 128*128;
- the number of feature maps in the fourth convolution feature map group CN4 is 64, and the fourth convolution
- the size of each feature map in the feature map group CN4 can be 64*64;
- the number of feature maps in the fifth convolution feature map group CN5 is 128, and each feature map in the fifth convolution feature map group CN5
- the fourth convolution feature map group CN4 is the input of the fourth down-sampling module 309, and the fourth down-sampling module 309 performs down-sampling processing on the fourth convolution feature map group CN4 to obtain the fourth down-sampled feature map group DP4.
- the number of feature maps in the fourth down-sampled feature map group DP4 is 64, and the size of each feature map in the fourth down-sampled feature map group DP4 is 32*32.
- the fourth down-sampling feature map group DP4 is the input of the fifth convolution module 305.
- the fifth convolution feature map group CN5 is the input of the fifth down-sampling module 310, and the fifth down-sampling module 310 performs down-sampling processing on the fifth convolution feature map group CN5 to obtain the fifth down-sampled feature map group DP5.
- the number of feature maps in the fifth down-sampled feature map group DP5 is 128, and the size of each feature map in the fifth down-sampled feature map group DP5 is 16*16.
- the fifth down-sampling characteristic map group DP5 is the input of the fully connected module 311.
- the sixth convolution feature map group CN6 is also the input of the fourth dimensionality reduction module 318, and the fourth dimensionality reduction module 318 performs dimensionality reduction processing on the sixth convolution feature map group CN6 to obtain the fourth dimensionality reduction feature map group DR4 ,
- the number of feature maps in the fourth dimensionality reduction feature map group DR4 is 10, and the size of each feature map in the fourth dimensionality reduction feature map group DR4 is 16*16.
- FIG. 4 is a schematic diagram of a pixel in a feature map and neighboring pixels of the pixel according to at least one embodiment of the present disclosure.
- a classification probability threshold may be set, for example, 0.7.
- the connection prediction probability of a pixel is greater than or equal to the classification probability threshold, it means that the pixel can be connected to an adjacent pixel.
- the value of the pixel PX1 in the first classification feature map is 0.8, that is, the connection prediction probability (0.8) of the pixel PX1 and the pixel PX2 is greater than the classification probability threshold (0.7).
- the text detection area group can be determined in a combined search method. For example, each intermediate input image passes through the text detection neural network shown in Figure 3 to obtain the text/non-text (positive/negative) classification prediction probability of each pixel, and the four neighborhood directions of each pixel and the pixel. Whether there is a link prediction probability of adjacent pixels.
- the text detection neural network includes a first convolution module 501 to a fifth convolution module 505, a first down-sampling module 506 to a fifth down-sampling module 510, and a fully connected Module 511, first up-sampling module 512 to third up-sampling module 514, first dimensionality reduction module 515 to fifth dimensionality reduction module 519, and classifier 520.
- using a text detection neural network to perform text detection on each intermediate input image to determine the text detection area group corresponding to each intermediate input image includes: using a first convolution module to perform convolution processing on the input image to obtain the first Convolution feature map group; use the first down-sampling module to perform down-sampling processing on the first convolution feature map group to obtain the first down-sampled feature map group; use the second convolution module to perform down-sampling on the first down-sampled feature map group Convolution processing to obtain the second convolution feature map group; use the second down-sampling module to perform down-sampling processing on the second convolution feature map group to obtain the second down-sampled feature map group, and use the first dimensionality reduction module Perform dimensionality reduction processing on the second convolution feature map group to obtain the first dimensionality reduction feature map group; use the third convolution module to perform convolution processing on the second down-sampled feature map group to obtain the third convolution feature map Group; use the third down-sampling module to perform down-s
- the number of feature maps in each of the first fusion feature map group FU51 to the fourth fusion feature map group FU54 is 18.
- the size of each feature map in the first fusion feature map group FU51 is 32*32; the size of each feature map in the second fusion feature map group FU52 is 64*64; each feature map in the third fusion feature map group FU53
- the size of each feature map is 128*128; the size of each feature map in the fourth fusion feature map group FU54 is 256*256.
- the classifier 520 performs classification processing on the fourth fusion feature map group FU54 to obtain a text classification prediction map and a connection classification prediction map.
- the text classification prediction map includes 2 feature maps
- the connection classification prediction map includes 16 feature maps. It should be noted that the value of each feature map in the text classification prediction map and the connection classification prediction map is greater than or equal to 0 and less than or equal to 1, and represents the text prediction probability or the connection prediction probability.
- the feature map in the text classification prediction map indicates the probability map of whether each pixel is text
- the feature map in the connection classification prediction map indicates the probability map of whether each pixel is connected to the neighboring pixels of the pixel's eight neighborhoods.
- the text detection neural network shown in Figure 5 combines the features extracted from the second convolution module to the fifth convolution module, while the text detection neural network shown in Figure 3 only combines the first Features extracted from the third convolution module to the fifth convolution module. Therefore, compared with the text detection neural network shown in Figure 5, the text detection neural network shown in Figure 3 has the characteristics of a small network model and a small amount of calculation under the condition of ensuring the detection accuracy. For example, the size of the network model The calculation speed is reduced by about 50 times, and the calculation speed is increased by about 10 times, which can reduce the calculation amount of the text detection neural network, speed up the calculation efficiency of the text detection neural network, reduce user waiting time, and improve user experience.
- FIG. 7A is the connection result of the connection based on the eight domain directions of pixels
- FIG. 7B is the connection result of the connection based on the four domain directions of the pixel. It can be seen from Figure 7A and Figure 7B that in Figure 7A, "any communications yet" is divided into the same text box, and "subjects in” is also divided into the same text box, that is, the phenomenon of text sticking occurs.
- a text box can include multiple texts.
- step S1013 includes: for the i-th text box, determining the coordinate group of the i-th text box according to the coordinate groups corresponding to the multiple i-th middle text boxes of the plurality of middle text box groups, thereby determining all the text in the text box group The coordinate group of the box.
- the obtained text box group can be more accurate.
- the coordinate group corresponding to each i-th middle text box may be the four vertices of the i-th middle text box of the rectangle (for example, the four vertices are the upper left vertex, the lower left vertex, the upper right vertex, and the lower right corner of the rectangle.
- the coordinates of the vertices can be determined based on the coordinates of the four vertices to determine the size and position of the i-th middle text box.
- the coordinate groups corresponding to the multiple i-th middle text boxes of the multiple middle text box groups may be weighted and summed to determine the coordinate group of the i-th text box.
- the coordinate groups corresponding to the first ith middle text box to the fifth ith middle text box are weighted and averaged to determine the coordinate group of the ith text box, for example, the first The coordinates of the upper left corner of the i middle text box to the fifth ith middle text box are weighted and averaged to obtain the coordinates of the upper left corner of the ith text box; the first ith middle text box to the fifth ith middle text box The coordinates of the bottom left vertex of the middle text box are weighted and averaged to obtain the coordinates of the bottom left vertex of the i-th text box; the coordinates of the top right vertex of the first i-th middle text box to the fifth i-th middle text box are performed Weighted average to get the coordinates of the top right corner of the i-th text box; the coordinates of the bottom right corner of the first i-th middle text box to the fifth i-th middle text box are weighted and averaged to get the i-th text box The coordinates of the vertex
- the method of determining the coordinate group of the i-th text box is not limited to the method described above, and other suitable methods can also be used according to the first i-th middle text box to the fifth middle text box.
- the coordinate group corresponding to the i-th middle text box determines the coordinate group of the i-th text box, which is not specifically limited in the present disclosure.
- FIG. 8A is a schematic diagram of a text box group in an input image provided by another embodiment of the present disclosure
- FIG. 8B is a schematic diagram of a text box group in another input image provided by another embodiment of the present disclosure.
- the overlap between at least one text box in the input image and the area to be detected is calculated separately, so that at least one overlap area can be determined.
- the text box corresponding to the largest overlap area in the at least one overlap area is used as the target text box.
- the text is the target text selected by the user.
- the third overlap area is the largest, that is, the third overlap area between the text box containing the text "neural" and the area to be detected is the largest, so that the text contains the text "neural"
- the text box is the target text box, and the text "neural" is the target text. It should be noted that FIG. 8B only shows the target text box.
- determining the correction angle and correction direction for the target text box according to the deflection angle and coordinate group of the at least one text box may include: determining N deflection angles corresponding to the N text boxes The average deflection angle of the text box; determine whether the average deflection angle is greater than the first angle threshold or less than the second angle threshold; in response to the average deflection angle being greater than the first angle threshold or less than the second angle threshold, determine the correction angle for the target text box 0 degrees; or, in response to the average deflection angle being less than or equal to the first angle threshold and greater than or equal to the second angle threshold, determine the N length and width corresponding to the N text boxes according to the N coordinate groups corresponding to the N text boxes Ratio, the correction direction for the target text box is determined according to the N aspect ratios, and the correction angle is determined according to the N de
- the coordinate group of each text box in at least one text box includes the coordinates of at least three vertices of each text box.
- each text box has four vertices
- the coordinate group of each text box includes the coordinates of the three vertices or the coordinates of the four vertices of each text box.
- the target text The box is the final target text box, and text recognition is directly performed on the final target text box (ie, target text box).
- the target text box needs to be rotated to obtain the final target text box, and then text recognition is performed on the final target text box.
- the vertex furthest from the X axis is taken as the first vertex T1, and the coordinates (x0, y0) of the first vertex T1 are determined , And then, based on the first vertex T1, clockwise to get the second vertex T2, the third vertex T3 and the fourth vertex T4 of the text box, and then determine the coordinates of the second vertex T2 (x1, y1) , The coordinates (x2, y2) of the third vertex T3 and the coordinates (x3, y3) of the fourth vertex T4.
- the width of the text box indicates that the first vertex T1 is the origin and is rotated counterclockwise to the nearest side of the text box
- the length of the text box indicates the width adjacent side of the text box.
- the width of the text box is expressed as Wd
- the length of the text box is expressed as Hg
- the aspect ratio of the text box is expressed as Hg/Wd.
- the width Wd of the text box is smaller than the length Hg of the text box.
- the width Wd of the text box may also be greater than or equal to the length Hg of the text box.
- the text box group is divided into a first text box subgroup and a second text box subgroup.
- the aspect ratio of each text box in the first text box subgroup is greater than or equal to 1, that is, the length of each text box in the first text box subgroup is greater than or equal to the width of the text box, for example, as shown in FIG. 9
- the text box of is the text box in the first text box subgroup.
- the aspect ratio of each text box in the second text box subgroup is less than 1, that is, the length of each text box in the first text box subgroup is less than the width of the text box.
- r0 is 2, but the present disclosure is not limited to this, and the value of r0 can be set according to specific requirements.
- the character recognition method further includes: responding to the number of first text boxes and the number of second text boxes not satisfying the first condition and the second condition , Make sure that the correction angle used for the target text box is 0 degrees.
- the judgment formula for the correction direction is:
- the correction direction is 0
- the correction direction is arbitrary or does not need to be corrected.
- the correction angle can be determined according to N deflection angles.
- the target text box does not need to be corrected.
- One angle quantity is the quantity of deflection angles in the first deflection angle group
- the second angle quantity is the quantity of deflection angles in the second deflection angle group
- the third angle quantity is the quantity of deflection angles in the third deflection angle group
- 1 ⁇ i ⁇ P, ai represents the i-th deflection angle from the first deflection angle in the second deflection angle group to the P-th deflection angle.
- the correction angle used for the target text box is the deflection angle of the target text box. It should be noted that, in some embodiments, when the deflection angle of the target text box is greater than the first angle threshold or less than the second angle threshold, it can be determined that the correction angle is 0 degrees.
- determining the correction direction for the target text box of the intermediate text according to the aspect ratio of the target text box includes: in response to the aspect ratio of the target text box being greater than or equal to 1, determining that the correction direction is a counterclockwise direction; Or, in response to the aspect ratio of the target text box being less than 1, it is determined that the correction direction is the clockwise direction.
- “in response to the correction angle” means that the response to the correction angle is not 0 degrees.
- rotating the target text box according to the correction angle to obtain the final target text box includes: rotating the input image according to the correction angle and the correction direction, so that the target text box is rotated to obtain the final target text box; or Perform cutting processing to obtain the cut target text box, and rotate the cut target text box according to the correction angle and correction direction to obtain the final target text box.
- p t represents the classification probability of different categories (for example, text prediction probability or connection prediction probability)
- (1-p t ) represents the adjustment coefficient
- ⁇ represents the focus parameter, and is a value greater than 0
- step S104 may include: using a text recognition neural network to perform recognition processing on the final target text box to obtain the intermediate text; and verify the intermediate text to obtain the target text.
- the text recognition neural network is a multi-objective corrective attention network (MORAN), and the multi-objective corrective attention network may include a corrective sub-network (MORN) and a recognition sub-network (ASRN).
- MORN corrective sub-network
- ASRN recognition sub-network
- the correction sub-network decomposes the final target text box into multiple small images, and then regresses the offset for each small image, and performs a smoothing operation on the offset, and then performs a sampling operation on the final target text box to obtain a new
- the horizontal text box with a more regular shape is the final target text box after correction.
- the recognition sub-network is to input the corrected final target text box into the convolutional recurrent neural network based on the attention mechanism for text recognition, so as to obtain the recognized intermediate text.
- using a text detection neural network to perform text detection on the input image to determine the text box group includes: performing scale transformation processing on the input image to obtain multiple intermediate input images; for each intermediate input of the multiple intermediate input images Image, use text detection neural network to perform text detection on each intermediate input image to obtain the intermediate text box group corresponding to each intermediate input image, thereby obtaining multiple intermediate text box groups corresponding to multiple intermediate input images, where each Each middle text box group includes at least one middle text box; the text box group is determined according to the plurality of middle text box groups.
- the plurality of intermediate input images include input images, and the sizes of the plurality of intermediate input images are different from each other. It should be noted that the relevant description of the intermediate input image can refer to the description in the embodiment of the above-mentioned character recognition method, which will not be repeated here.
- the text recognition device 1200 further includes a translation pen 1250, and the translation pen 1250 is used to select the target text.
- the image acquisition device 1210 is arranged on the translation pen 1250.
- the image acquisition device 1210 may be a camera arranged on the translation pen 1250.
- the electronic device can receive the input image sent from the translation pen 1250 via a wired or wireless manner, and perform text recognition processing on the input image.
- the memory 1220 and the processor 1230 may also be integrated in a cloud server.
- the translation pen 1250 and the cloud server communicate in a wired or wireless manner.
- the cloud server receives the input image and performs text recognition processing on the input image.
- the text recognition device 1200 may further include an output device, and the output device is used to output the translation result of the target text.
- the output device may include a display, a speaker, a projector, etc.
- the display may be used to display the translation result of the target text
- the speaker may be used to output the translation result of the target text in the form of voice.
- the translation pen 1250 may further include a communication module, which is used to implement communication between the translation pen 1250 and the output device, for example, to transmit the translation result to the output device.
- the processor 1230 may control other components in the character recognition device 1200 to perform desired functions.
- the processor 1230 may be a central processing unit (CPU), a tensor processor (TPU), and other devices with data processing capabilities and/or program execution capabilities.
- the central processing unit (CPU) can be an X86 or ARM architecture.
- the GPU can be directly integrated on the motherboard alone or built into the north bridge chip of the motherboard. The GPU can also be built into the central processing unit (CPU).
- the memory 1220 may include any combination of one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- Volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
- Non-volatile memory may include, for example, read only memory (ROM), hard disk, erasable programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, flash memory, etc.
- One or more computer-readable instructions may be stored on the computer-readable storage medium, and the processor 1230 may run the computer-readable instructions to implement various functions of the text recognition apparatus 1200.
- the network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network.
- the network may include a local area network, the Internet, a telecommunications network, the Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination of the above networks, and so on.
- the wired network may, for example, use twisted pair, coaxial cable, or optical fiber transmission for communication, and the wireless network may use, for example, a 3G/4G/5G mobile communication network, Bluetooth, Zigbee, or WiFi.
- the present disclosure does not limit the types and functions of the network here.
- FIG. 13 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
- one or more computer-readable instructions 1301 may be stored on the storage medium 1300 non-transitory.
- the computer-readable instructions 1301 are executed by a computer, one or more steps in the character recognition method described above can be executed.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims (40)
- 一种文字识别方法,包括:获取输入图像;对所述输入图像进行文本检测,以确定文本框组,其中,所述文本框组包括至少一个文本框;从所述至少一个文本框中确定目标文本框,其中,所述目标文本框包括目标文本;获取所述至少一个文本框的坐标组和相对于基准方向的偏转角度,根据所述至少一个文本框的偏转角度和坐标组,确定用于所述目标文本框的校正角度和校正方向,按照所述校正角度和所述校正方向旋转所述目标文本框以得到最终目标文本框;对所述最终目标文本框进行识别,以得到所述目标文本。
- 根据权利要求1所述的文字识别方法,其中,所述至少一个文本框包括N个文本框,N为大于2的正整数,根据所述至少一个文本框的偏转角度和坐标组,确定用于所述目标文本框的所述校正角度和所述校正方向包括:根据所述N个文本框对应的N个偏转角度,确定所述N个文本框的平均偏转角度;判断所述平均偏转角度是否大于第一角度阈值或小于第二角度阈值;响应于所述平均偏转角度大于所述第一角度阈值或小于所述第二角度阈值,确定用于所述目标文本框的校正角度为0度;或者,响应于所述平均偏转角度小于等于所述第一角度阈值且大于等于所述第二角度阈值,根据所述N个文本框对应的N个坐标组,确定分别对应于所述N个文本框的N个长宽比,根据所述N个长宽比确定用于所述目标文本框的所述校正方向,响应于所述校正方向,根据所述N个偏转角度,确定所述校正角度。
- 根据权利要求2所述的文字识别方法,其中,根据所述N个长宽比确定用于所述目标文本框的所述校正方向包括:根据所述N个长宽比,将所述N个文本框分为第一文本框子组和第二文本框子组,其中,所述第一文本框子组中的每个文本框的长宽比大于等于1, 所述第二文本框子组中的每个文本框的长宽比小于1;根据所述第一文本框子组和所述第二文本框子组,确定第一文本框数量和第二文本框数量,其中,所述第一文本框数量为所述第一文本框子组中的文本框的数量,所述第二文本框数量为所述第二文本框子组中的文本框的数量;根据所述第一文本框数量和所述第二文本框数量,确定所述校正方向。
- 根据权利要求3所述的文字识别方法,其中,根据所述第一文本框数量和所述第二文本框数量,确定所述校正方向包括:响应于所述第一文本框数量和所述第二文本框数量满足第一条件,则确定所述校正方向为逆时针方向;或者,响应于所述第一文本框数量和所述第二文本框数量满足第二条件,则确定所述校正方向为顺时针方向,其中,所述第一条件为ra>rb+r0,所述第二条件为ra+r0<rb,ra为所述第一文本框数量,rb为所述第二文本框数量,r0为常数。
- 根据权利要求4所述的文字识别方法,其中,在响应于所述平均偏转角度小于等于所述第一角度阈值且大于等于所述第二角度阈值,所述文字识别方法还包括:响应于所述第一文本框数量和所述第二文本框数量不满足所述第一条件和所述第二条件,确定用于所述目标文本框的校正角度为0度。
- 根据权利要求4或5所述的文字识别方法,其中,r0为2。
- 根据权利要求2-6任一项所述的文字识别方法,其中,响应于所述校正方向,根据所述N个偏转角度,确定所述校正角度包括:响应于所述校正方向,将所述N个偏转角度按照升序进行排序以得到第一偏转角度至第N个偏转角度,其中,所述N个偏转角度中的第P个偏转角度和第P+1个偏转角度之差大于10度,P为正整数且小于N;将所述N个偏转角度划分为第一偏转角度组、第二偏转角度组和第三偏转角度组,其中,所述第一偏转角度组中的偏转角度均为0度,所述第二偏转角度组包括第一偏转角度至所述第P个偏转角度,所述第三偏转角度组包括所述第P+1偏转角度至第N个偏转角度;根据所述第一偏转角度组、所述第二偏转角度组和所述第三偏转角度组,确定第一角度数量、第二角度数量和第三角度数量,其中,所述第一角度数量为所述第一偏转角度组中的偏转角度的数量,所述第二角度数量为所述第二偏 转角度组中的偏转角度的数量,所述第三角度数量为所述第三偏转角度组中的偏转角度的数量;根据所述第一角度数量、所述第二角度数量和所述第三角度数量,确定所述校正角度。
- 根据权利要求7所述的文字识别方法,其中,根据所述第一角度数量、所述第二角度数量和所述第三角度数量,确定所述校正角度包括:响应于所述第一角度数量满足第三条件,则确定所述校正角度为0度;或者响应于所述第一角度数量不满足所述第三条件,且所述第二角度数量和所述第三角度数量满足第四条件,则确定所述校正角度为第一角度值;或者响应于所述第一角度数量不满足所述第三条件,且所述第二角度数量和所述第三角度数量满足第五条件,则确定所述校正角度为第二角度值;或者响应于所述第一角度数量不满足所述第三条件且所述第二角度数量和所述第三角度数量不满足所述第四条件和所述第五条件,则确定所述校正角度为0度;其中,所述第三条件为s0>ss1,所述第四条件为s1>s2+ss2,所述第五条件为s1+ss2<s2,s0为所述第一角度数量,s1为所述第二角度数量,s2为所述第三角度数量,ss1为常数,ss2为常数,所述第一角度值表示为:其中,1≤i≤P,ai表示所述第二偏转角度组中的所述第一偏转角度至所述第P个偏转角度中的第i个偏转角度,所述第二角度值表示为:其中,P+1≤j≤N,aj表示所述第三偏转角度组中的所述第P+1偏转角度至所述第N个偏转角度中的第j个偏转角度。
- 根据权利要求8所述的文字识别方法,其中,ss1为5,ss2为2。
- 根据权利要求2-9任一项所述的文字识别方法,其中,所述第一角度阈值为80度,所述第二角度阈值为10度。
- 根据权利要求2-10任一项所述的文字识别方法,其中,所述最终目标文本框相对于所述基准方向的偏转角度大于所述第一角度阈值或者小于所述第二角度阈值。
- 根据权利要求1所述的文字识别方法,其中,所述至少一个文本框包括N个文本框,N为1或2,根据所述至少一个文本框的偏转角度和坐标组,确定用于所述目标文本框的所述校正角度和所述校正方向包括:根据所述目标文本框的偏转角度,确定用于所述目标文本框的所述校正角度;响应于所述校正角度,根据所述目标文本框的坐标组,确定所述目标文本框的长宽比;根据所述目标文本框的长宽比,确定用于所述目标文本框的所述校正方向。
- 根据权利要求12所述的文字识别方法,其中,根据所述目标文本框的长宽比,确定用于所述目标文本框的所述校正方向包括:响应于所述目标文本框的长宽比大于等于1,确定所述校正方向为逆时针方向;或者响应于所述目标文本框的长宽比小于1,确定所述校正方向为顺时针方向。
- 根据权利要求1-13任一项所述的文字识别方法,其中,所述至少一个文本框为矩形框,所述至少一个文本框中的每个文本框的坐标组包括所述每个文本框的至少三个顶点的坐标。
- 根据权利要求1-14任一项所述的文字识别方法,其中,所述至少一个文本框中的每个文本框的偏转角度大于等于0度且小于等于90度。
- 根据权利要求1-15任一项所述的文字识别方法,其中,按照所述校正角度和所述校正方向旋转所述目标文本框以得到所述最终目标文本框包括:按照所述校正角度和所述校正方向旋转所述输入图像,以使得所述目标文本框旋转得到所述最终目标文本框;或者对所述目标文本框进行切割处理以得到切割后的目标文本框,按照所述校正角度和所述校正方向旋转所述切割后的目标文本框,以得到所述最终目标文本框。
- 根据权利要求1-16任一项所述的文字识别方法,其中,对所述输入图像进行文本检测,以确定所述文本框组包括:对所述输入图像进行尺度变换处理,以得到多个中间输入图像,其中,所述多个中间输入图像包括所述输入图像,且所述多个中间输入图像的尺寸彼此不相同;对于所述多个中间输入图像中的每个中间输入图像,对所述每个中间输入图像进行文本检测,以得到所述每个中间输入图像对应的中间文本框组,从而得到所述多个中间输入图像对应的多个中间文本框组,其中,每个所述中间文本框组包括至少一个中间文本框;根据所述多个中间文本框组,确定所述文本框组。
- 根据权利要求17所述的文字识别方法,其中,所述至少一个中间文本框与所述至少一个文本框一一对应,每个所述中间文本框组包括第i中间文本框,所述文本框组包括第i文本框,所述第i中间文本框与所述第i文本框对应,i大于等于1且小于等于每个所述中间文本框组中的中间文本框的数量,根据所述多个中间文本框组,确定所述文本框组包括:对于所述第i文本框,根据所述多个中间文本框组的多个第i中间文本框对应的坐标组,确定所述第i文本框的坐标组,从而确定所述文本框组。
- 根据权利要求17或18所述的文字识别方法,其中,对所述每个中间输入图像进行文本检测,以得到所述每个中间输入图像对应的中间文本框组包括:利用文本检测神经网络对所述每个中间输入图像进行文本检测,以确定所述每个中间输入图像对应的文本检测区域组;利用最小外接矩形算法对所述文本检测区域组进行处理,以确定所述中间文本框组,其中,所述文本检测区域组包括至少一个文本检测区域,所述至少一个文本检测区域与所述至少一个中间文本框一一对应,且每个所述中间文本框覆盖对应的文本检测区域。
- 根据权利要求19所述的文字识别方法,其中,所述文本检测神经网络包括第一卷积模块至第五卷积模块、第一下采样模块至第五下采样模块、全连接模块、第一上采样模块至第三上采样模块、第一降维模块至第四降维模块和分类器,利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以确定所述每个中间输入图像对应的所述文本检测区域组包括:使用所述第一卷积模块对所述每个中间输入图像进行卷积处理,以得到第一卷积特征图组;使用所述第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用所述第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到第二卷积特征图组;使用所述第二下采样模块对所述第二卷积特征图组进行下采样处理,以得到第二下采样特征图组;使用所述第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用所述第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用所述第一降维模块对所述第三卷积特征图组进行降维处理,以得到第一降维特征图组;使用所述第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用所述第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用所述第二降维模块对所述第四卷积特征图组进行降维处理,以得到第二降维特征图组;使用所述第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用所述第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用所述第三降维模块对所述第五卷积特征图组进行降维处理,以得到第三降维特征图组;使用所述全连接模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用所述第四降维模块对所述第六卷积特征图组进行降维处理,以得到第四降维特征图组;使用所述第一上采样模块对所述第四降维特征图组进行上采样处理,以得到第一上采样特征图组;对所述第一上采样特征图组和所述第三降维特征图组进行融合处理,以得到第一融合特征图组;使用所述第二上采样模块对所述第一融合特征图组进行上采样处理,以得到第二上采样特征图组;对所述第二上采样特征图组和所述第二降维特征图组进行融合处理,以得到第二融合特征图组;使用所述第三上采样模块对所述第二融合特征图组进行上采样处理,以得到第三上采样特征图组;对所述第三上采样特征图组和所述第一降维特征图组进行融合处理,以得到第三融合特征图组;使用所述分类器对所述第三融合特征图组进行分类处理,以得到文本分类预测图和连接分类预测图;根据所述连接分类预测图和所述文本分类预测图,以确定所述文本检测区域组。
- 根据权利要求20所述的文字识别方法,其中,所述第一卷积特征图组中的特征图的数量为8,所述第二卷积特征图组中的特征图的数量为16,所述第三卷积特征图组中的特征图的数量为32,所述第四卷积特征图组中的特征图的数量为64,所述第五卷积特征图组中的特征图的数量为128,所述第六卷积特征图组中的特征图的数量为256,所述第一降维特征图组中的特征图的数量为10,所述第二降维特征图组中的特征图的数量为10,所述第三降维特征图组中的特征图的数量为10,所述第四降维特征图组中的特征图的数量为10。
- 根据权利要求19所述的文字识别方法,其中,所述文本检测神经网络包括第一卷积模块至第五卷积模块、第一下采样模块至第五下采样模块、全连接模块、第一上采样模块至第三上采样模块、第一降维模块至第五降维模块和分类器,利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以确定所述每个中间输入图像对应的文本检测区域组包括:使用所述第一卷积模块对所述输入图像进行卷积处理,以得到第一卷积特征图组;使用所述第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用所述第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到 第二卷积特征图组;使用所述第二下采样模块对所述第二卷积特征图组进行下采样处理,以得到第二下采样特征图组,且使用所述第一降维模块对所述第二卷积特征图组进行降维处理,以得到第一降维特征图组;使用所述第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用所述第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用所述第二降维模块对所述第三卷积特征图组进行降维处理,以得到第二降维特征图组;使用所述第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用所述第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用所述第三降维模块对所述第四卷积特征图组进行降维处理,以得到第三降维特征图组;使用所述第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用所述第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用所述第四降维模块对所述第五卷积特征图组进行降维处理,以得到第四降维特征图组;使用所述全连接模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用所述第五降维模块对所述第六卷积特征图组进行降维处理,以得到第五降维特征图组;对所述第四降维特征图组和所述第五降维特征图组进行融合处理,以得到第一融合特征图组;使用所述第一上采样模块对所述第一融合特征图组进行上采样处理,以得到第一上采样特征图组;对所述第一上采样特征图组和所述第三降维特征图组进行融合处理,以得到第二融合特征图组;使用所述第二上采样模块对所述第二融合特征图组进行上采样处理,以得到第二上采样特征图组;对所述第二上采样特征图组和所述第二降维特征图组进行融合处理,以得到第三融合特征图组;使用所述第三上采样模块对所述第三融合特征图组进行上采样处理,以得到第三上采样特征图组;对所述第三上采样特征图组和所述第一降维特征图组进行融合处理,以得到第四融合特征图组;使用所述分类器对所述第四融合特征图组进行分类处理,以得到文本分类预测图和连接分类预测图;根据所述连接分类预测图和所述文本分类预测图,以确定所述文本检测区域组。
- 根据权利要求22所述的文字识别方法,其中,所述第一卷积特征图组中的特征图的数量为64,所述第二卷积特征图组中的特征图的数量为128,所述第三卷积特征图组中的特征图的数量为256,所述第四卷积特征图组中的特征图的数量为512,所述第五卷积特征图组中的特征图的数量为512,所述第六卷积特征图组中的特征图的数量为512,所述第一降维特征图组至所述第五降维特征图组每个中的特征图的数量为18。
- 根据权利要求19-23任一项所述的文字识别方法,其中,在获取所述输入图像之前,所述文字识别方法还包括:训练待训练文本检测神经网络,以得到所述文本检测神经网络,训练待训练文本检测神经网络,以得到所述文本检测神经网络包括:获取训练输入图像和目标文本检测区域组;利用所述待训练文本检测神经网络对所述训练输入图像进行处理,以得到训练文本检测区域组;根据所述目标文本检测区域组和所述训练文本检测区域组,通过损失函数计算所述待训练文本检测神经网络的损失值;根据所述损失值对所述待训练文本检测神经网络的参数进行修正,在所述损失函数满足预定条件时,得到训练好的所述文本检测神经网络,在所述损失函数不满足所述预定条件时,继续输入所述训练输入图像和所述目标文本检测区域组以重复执行上述训练过程。
- 根据权利要求24所述的文字识别方法,其中,所述损失函数包括焦 点损失函数。
- 根据权利要求1-25任一项所述的文字识别方法,其中,从所述至少一个文本框中确定目标文本框包括:确定点译笔的笔尖的位置;基于所述笔尖的位置,在所述输入图像中标注出待检测区域;确定所述待检测区域分别与所述至少一个文本框之间的至少一个重叠区域;确定所述至少一个重叠区域中的最大重叠区域对应的文本框作为所述目标文本框。
- 根据权利要求1-26任一项所述的文字识别方法,其中,对所述最终目标文本框进行识别,以得到所述目标文本包括:利用所述文本识别神经网络对所述最终目标文本框进行识别处理,以得到中间文本;对所述中间文本进行校验,以得到所述目标文本。
- 根据权利要求27所述的文字识别方法,其中,所述文本识别神经网络为多目标纠正注意网络。
- 根据权利要求1-28任一项所述的文字识别方法,还包括:对所述目标文本进行翻译,以得到并输出所述目标文本的翻译结果。
- 一种文字识别方法,包括:获取输入图像;利用文本检测神经网络对所述输入图像进行文本检测,以确定文本框组,其中,所述文本框组包括至少一个文本框;从所述至少一个文本框中确定目标文本框,其中,所述目标文本框包括目标文本;旋转所述目标文本框以得到最终目标文本框;对所述最终目标文本框进行识别,以得到所述目标文本,其中,所述文本检测神经网络包括第一卷积模块至第五卷积模块和第一降维模块至第四降维模块,所述第一卷积模块中的每个卷积层中的卷积核的数量为8,所述第二卷积模块中的每个卷积层中的卷积核的数量为16,所述第三卷积模块中的每个卷积层中的卷积核的数量为32,所述第四卷积模块中的每个卷积层中的卷积核的数 量为64,所述第五卷积模块中的每个卷积层中的卷积核的数量为128,所述第一降维模块中的每个卷积层中的卷积核的数量为10,所述第二降维模块中的每个卷积层中的卷积核的数量为10,所述第三降维模块中的每个卷积层中的卷积核的数量为10,所述第四降维模块中的每个卷积层中的卷积核的数量为10。
- 根据权利要求30所述的文字识别方法,其中,利用所述文本检测神经网络对所述输入图像进行文本检测,以确定文本框组,包括:对所述输入图像进行尺度变换处理,以得到多个中间输入图像,其中,所述多个中间输入图像包括所述输入图像,且所述多个中间输入图像的尺寸彼此不相同;对于所述多个中间输入图像中的每个中间输入图像,利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以得到所述每个中间输入图像对应的中间文本框组,从而得到所述多个中间输入图像对应的多个中间文本框组,其中,每个所述中间文本框组包括至少一个中间文本框;根据所述多个中间文本框组,确定所述文本框组。
- 根据权利要求31所述的文字识别方法,其中,所述至少一个中间文本框与所述至少一个文本框一一对应,每个所述中间文本框组包括第i中间文本框,所述文本框组包括第i文本框,所述第i中间文本框与所述第i文本框对应,i大于等于1且小于等于每个所述中间文本框组中的中间文本框的数量,根据所述多个中间文本框组,确定所述文本框组包括:对于所述第i文本框,根据所述多个中间文本框组的多个第i中间文本框对应的坐标组,确定所述第i文本框的坐标组,从而确定所述文本框组。
- 根据权利要求31或32所述的文字识别方法,其中,利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以得到所述每个中间输入图像对应的中间文本框组,包括:利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以确定所述每个中间输入图像对应的文本检测区域组;利用最小外接矩形算法对所述文本检测区域组进行处理,以确定所述中间文本框组,其中,所述文本检测区域组包括至少一个文本检测区域,所述至少一个文本检测区域与所述至少一个中间文本框一一对应,且每个所述中间文本 框覆盖对应的文本检测区域。
- 根据权利要求33所述的文字识别方法,其中,所述文本检测神经网络还包括第一下采样模块至第五下采样模块、全连接模块、第一上采样模块至第三上采样模块和分类器,利用所述文本检测神经网络对所述每个中间输入图像进行文本检测,以确定所述每个中间输入图像对应的所述文本检测区域组,包括:使用所述第一卷积模块对所述每个中间输入图像进行卷积处理,以得到第一卷积特征图组;使用所述第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用所述第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到第二卷积特征图组;使用所述第二下采样模块对所述第二卷积特征图组进行下采样处理,以得到第二下采样特征图组;使用所述第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用所述第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用所述第一降维模块对所述第三卷积特征图组进行降维处理,以得到第一降维特征图组;使用所述第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用所述第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用所述第二降维模块对所述第四卷积特征图组进行降维处理,以得到第二降维特征图组;使用所述第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用所述第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用所述第三降维模块对所述第五卷积特征图组进行降维处理,以得到第三降维特征图组;使用所述全连接模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用所述第四降维模块对所述第六卷积特征图组进行降维处理,以得到第四降维特征图组;使用所述第一上采样模块对所述第四降维特征图组进行上采样处理,以得到第一上采样特征图组;对所述第一上采样特征图组和所述第三降维特征图组进行融合处理,以得到第一融合特征图组;使用所述第二上采样模块对所述第一融合特征图组进行上采样处理,以得到第二上采样特征图组;对所述第二上采样特征图组和所述第二降维特征图组进行融合处理,以得到第二融合特征图组;使用所述第三上采样模块对所述第二融合特征图组进行上采样处理,以得到第三上采样特征图组;对所述第三上采样特征图组和所述第一降维特征图组进行融合处理,以得到第三融合特征图组;使用所述分类器对所述第三融合特征图组进行分类处理,以得到文本分类预测图和连接分类预测图;根据所述连接分类预测图和所述文本分类预测图,以确定所述文本检测区域组。
- 根据权利要求34所述的文字识别方法,其中,所述第一卷积特征图组中的特征图的数量为8,所述第二卷积特征图组中的特征图的数量为16,所述第三卷积特征图组中的特征图的数量为32,所述第四卷积特征图组中的特征图的数量为64,所述第五卷积特征图组中的特征图的数量为128,所述第六卷积特征图组中的特征图的数量为256,所述第一降维特征图组中的特征图的数量为10,所述第二降维特征图组中的特征图的数量为10,所述第三降维特征图组中的特征图的数量为10,所述第四降维特征图组中的特征图的数量为10。
- 根据权利要求30-35任一项所述的文字识别方法,其中,在获取所述输入图像之前,所述文字识别方法还包括:训练待训练文本检测神经网络,以得到所述文本检测神经网络,训练待训练文本检测神经网络,以得到所述文本检测神经网络包括:获取训练输入图像和目标文本检测区域组;利用所述待训练文本检测神经网络对所述训练输入图像进行处理,以得到训练文本检测区域组;根据所述目标文本检测区域组和所述训练文本检测区域组,通过损失函数计算所述待训练文本检测神经网络的损失值;根据所述损失值对所述待训练文本检测神经网络的参数进行修正,在所述损失函数满足预定条件时,得到训练好的所述文本检测神经网络,在所述损失函数不满足所述预定条件时,继续输入所述训练输入图像和所述目标文本检测区域组以重复执行上述训练过程。
- 根据权利要求36所述的文字识别方法,其中,所述损失函数包括焦点损失函数。
- 一种文字识别装置,包括:图像采集装置,用于获取输入图像;存储器,用于存储所述输入图像以及计算机可读指令;处理器,用于读取所述输入图像,并运行所述计算机可读指令,所述计算机可读指令被所述处理器运行时执行根据权利要求1-37任一项所述的文字识别方法。
- 根据权利要求38所述的文字识别装置,还包括点译笔,其中,所述图像采集装置设置在所述点译笔上,所述点译笔用于选择所述目标文本。
- 一种存储介质,非暂时性地存储计算机可读指令,其中,当所述计算机可读指令由计算机执行时可以执行根据权利要求1-37任一项所述的文字识别方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/073576 WO2021146937A1 (zh) | 2020-01-21 | 2020-01-21 | 文字识别方法、文字识别装置和存储介质 |
CN202080000058.XA CN113498520B (zh) | 2020-01-21 | 2020-01-21 | 文字识别方法、文字识别装置和存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/073576 WO2021146937A1 (zh) | 2020-01-21 | 2020-01-21 | 文字识别方法、文字识别装置和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021146937A1 true WO2021146937A1 (zh) | 2021-07-29 |
Family
ID=76992750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/073576 WO2021146937A1 (zh) | 2020-01-21 | 2020-01-21 | 文字识别方法、文字识别装置和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113498520B (zh) |
WO (1) | WO2021146937A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627427A (zh) * | 2021-08-04 | 2021-11-09 | 中国兵器装备集团自动化研究所有限公司 | 一种基于图像检测技术的仪器仪表读数方法及系统 |
CN114757304A (zh) * | 2022-06-10 | 2022-07-15 | 北京芯盾时代科技有限公司 | 一种数据识别方法、装置、设备及存储介质 |
CN116740721A (zh) * | 2023-08-15 | 2023-09-12 | 深圳市玩瞳科技有限公司 | 手指查句方法、装置、电子设备及计算机存储介质 |
CN117809318A (zh) * | 2024-03-01 | 2024-04-02 | 微山同在电子信息科技有限公司 | 基于机器视觉的甲骨文识别方法及其系统 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116958981B (zh) * | 2023-05-31 | 2024-04-30 | 广东南方网络信息科技有限公司 | 一种文字识别方法及装置 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720316B2 (en) * | 2006-09-05 | 2010-05-18 | Microsoft Corporation | Constraint-based correction of handwriting recognition errors |
CN110490198A (zh) * | 2019-08-12 | 2019-11-22 | 上海眼控科技股份有限公司 | 文本方向校正方法、装置、计算机设备和存储介质 |
CN110659633A (zh) * | 2019-08-15 | 2020-01-07 | 坎德拉(深圳)科技创新有限公司 | 图像文本信息的识别方法、装置以及存储介质 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016004330A1 (en) * | 2014-07-03 | 2016-01-07 | Oim Squared Inc. | Interactive content generation |
CN109635805B (zh) * | 2018-12-11 | 2022-01-11 | 上海智臻智能网络科技股份有限公司 | 图像文本定位方法及装置、图像文本识别方法及装置 |
-
2020
- 2020-01-21 CN CN202080000058.XA patent/CN113498520B/zh active Active
- 2020-01-21 WO PCT/CN2020/073576 patent/WO2021146937A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720316B2 (en) * | 2006-09-05 | 2010-05-18 | Microsoft Corporation | Constraint-based correction of handwriting recognition errors |
CN110490198A (zh) * | 2019-08-12 | 2019-11-22 | 上海眼控科技股份有限公司 | 文本方向校正方法、装置、计算机设备和存储介质 |
CN110659633A (zh) * | 2019-08-15 | 2020-01-07 | 坎德拉(深圳)科技创新有限公司 | 图像文本信息的识别方法、装置以及存储介质 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627427A (zh) * | 2021-08-04 | 2021-11-09 | 中国兵器装备集团自动化研究所有限公司 | 一种基于图像检测技术的仪器仪表读数方法及系统 |
CN113627427B (zh) * | 2021-08-04 | 2023-09-22 | 中国兵器装备集团自动化研究所有限公司 | 一种基于图像检测技术的仪器仪表读数方法及系统 |
CN114757304A (zh) * | 2022-06-10 | 2022-07-15 | 北京芯盾时代科技有限公司 | 一种数据识别方法、装置、设备及存储介质 |
CN116740721A (zh) * | 2023-08-15 | 2023-09-12 | 深圳市玩瞳科技有限公司 | 手指查句方法、装置、电子设备及计算机存储介质 |
CN116740721B (zh) * | 2023-08-15 | 2023-11-17 | 深圳市玩瞳科技有限公司 | 手指查句方法、装置、电子设备及计算机存储介质 |
CN117809318A (zh) * | 2024-03-01 | 2024-04-02 | 微山同在电子信息科技有限公司 | 基于机器视觉的甲骨文识别方法及其系统 |
CN117809318B (zh) * | 2024-03-01 | 2024-05-28 | 微山同在电子信息科技有限公司 | 基于机器视觉的甲骨文识别方法及其系统 |
Also Published As
Publication number | Publication date |
---|---|
CN113498520A (zh) | 2021-10-12 |
CN113498520B (zh) | 2024-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021146937A1 (zh) | 文字识别方法、文字识别装置和存储介质 | |
CN111210443B (zh) | 基于嵌入平衡的可变形卷积混合任务级联语义分割方法 | |
CN108427924B (zh) | 一种基于旋转敏感特征的文本回归检测方法 | |
WO2021073493A1 (zh) | 图像处理方法及装置、神经网络的训练方法、合并神经网络模型的图像处理方法、合并神经网络模型的构建方法、神经网络处理器及存储介质 | |
CN108830855B (zh) | 一种基于多尺度低层特征融合的全卷积网络语义分割方法 | |
WO2020200030A1 (zh) | 神经网络的训练方法、图像处理方法、图像处理装置和存储介质 | |
CN109241982B (zh) | 基于深浅层卷积神经网络的目标检测方法 | |
WO2018145470A1 (zh) | 一种图像检测方法和装置 | |
CN107358260B (zh) | 一种基于表面波cnn的多光谱图像分类方法 | |
WO2020108009A1 (en) | Method, system, and computer-readable medium for improving quality of low-light images | |
CN109117846B (zh) | 一种图像处理方法、装置、电子设备和计算机可读介质 | |
WO2021146951A1 (zh) | 文本检测方法及装置、存储介质 | |
CN109948566B (zh) | 一种基于权重融合与特征选择的双流人脸反欺诈检测方法 | |
AU2020101435A4 (en) | A panoramic vision system based on the uav platform | |
WO2020093782A1 (en) | Method, system, and computer-readable medium for improving quality of low-light images | |
CN110909724B (zh) | 一种多目标图像的缩略图生成方法 | |
WO2020048359A1 (en) | Method, system, and computer-readable medium for improving quality of low-light images | |
CN110633640A (zh) | 优化PointNet对于复杂场景的识别方法 | |
CN116385707A (zh) | 基于多尺度特征与特征增强的深度学习场景识别方法 | |
CN110517270A (zh) | 一种基于超像素深度网络的室内场景语义分割方法 | |
CN112348056A (zh) | 点云数据分类方法、装置、设备及可读存储介质 | |
WO2022063321A1 (zh) | 图像处理方法、装置、设备及存储介质 | |
CN114830168A (zh) | 图像重建方法、电子设备和计算机可读存储介质 | |
CN115482529A (zh) | 近景色水果图像识别方法、设备、存储介质及装置 | |
WO2019071476A1 (zh) | 一种基于智能终端的快递信息录入方法及录入系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20914951 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20914951 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20914951 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.03.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20914951 Country of ref document: EP Kind code of ref document: A1 |