WO2021017998A1 - 文本位置定位方法和系统以及模型训练方法和系统 - Google Patents

文本位置定位方法和系统以及模型训练方法和系统 Download PDF

Info

Publication number
WO2021017998A1
WO2021017998A1 PCT/CN2020/103799 CN2020103799W WO2021017998A1 WO 2021017998 A1 WO2021017998 A1 WO 2021017998A1 CN 2020103799 W CN2020103799 W CN 2020103799W WO 2021017998 A1 WO2021017998 A1 WO 2021017998A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
text box
candidate
box
level
Prior art date
Application number
PCT/CN2020/103799
Other languages
English (en)
French (fr)
Inventor
顾立新
韩锋
韩景涛
曾华荣
刘庆杰
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2021017998A1 publication Critical patent/WO2021017998A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Definitions

  • the present disclosure generally relates to the field of artificial intelligence, and more specifically, to a method and system for locating a text position in an image, and a method and system for training a text position detection model.
  • the text in the image contains rich information, and the extraction of this information (ie, text recognition) is of great significance to the understanding of the scene in the image.
  • Text recognition is divided into two steps: text detection (ie, locating the text position) and text recognition (ie, recognizing the content of the text), both of which are indispensable, and text detection, as a prerequisite for text recognition, is particularly critical.
  • text detection ie, locating the text position
  • text recognition ie, recognizing the content of the text
  • text detection ie, recognizing the content of the text
  • text detection effect in complex scenes or natural scenes is often poor due to the following difficulties: (1) Different shooting angles may make the text deformed; (2) The text has multiple directions, which may There are horizontal text and rotated text; (3) The text size is different, and the degree of compactness is different. There are both long text and short text in the same image, and the layout is tight or loose.
  • the present disclosure is to at least solve the above difficulties in existing text detection methods, so as to improve the text position detection effect.
  • a method for locating a text position in an image may include: obtaining a predicted image sample; using a pre-trained deep neural network-based text position detection model to determine The final text box for locating the text position in the image sample, where the text position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used for Extract the features of the predicted image sample to generate a feature map.
  • the candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map.
  • the cascaded multi-level text box branch is used to generate a feature map.
  • the feature corresponding to each candidate text area is used to predict the candidate horizontal text box, and the mask branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and according to the prediction
  • the mask information obtained determines the final text box used to locate the text position in the predicted image sample.
  • a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device is prompted to execute the above-mentioned The method of positioning the text position in the image.
  • a system including at least one computing device and at least one storage device storing instructions, wherein when the instructions are executed by the at least one computing device, the at least one The computing device executes the following steps of the implementation method of the machine learning modeling process: obtaining a predicted image sample; using a pre-trained deep neural network-based text position detection model to determine the final text box for locating the text position in the predicted image sample,
  • the text position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used to extract features of predicted image samples to generate feature maps.
  • the regional recommendation layer is used to determine a predetermined number of candidate text regions in the predicted image samples based on the generated feature map, and the cascaded multi-level text box branches are used to predict candidates based on the features corresponding to each candidate text region in the feature map Horizontal text box, the mask branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and determine the mask information for the predicted image sample according to the predicted mask information The final text box in which to locate the text position.
  • a system for locating a text position in an image may include: a predictive image sample acquisition device configured to acquire a predictive image sample; and a text position positioning device configured In order to use a pre-trained deep neural network-based text position detection model to determine the final text box for locating the text position in the predicted image sample, the text position detection model includes a feature extraction layer, a candidate region recommendation layer, and Linked multi-level text box branches and mask branches, where the feature extraction layer is used to extract the features of the predicted image sample to generate a feature map, and the candidate region recommendation layer is used to determine a predetermined number of predicted image samples based on the generated feature map
  • cascaded multi-level text box branch is used to predict the candidate horizontal text box based on the feature corresponding to each candidate text area in the feature map
  • the mask branch is used to correspond to the candidate horizontal text box based on the feature map
  • a method for training a text position detection model may include: obtaining a training image sample set, wherein the text position is marked with a text box in the training image sample;
  • the training image sample set trains a text location detection model based on a deep neural network, where the text location detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches, where feature extraction
  • the layer is used to extract the features of the image to generate a feature map
  • the candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the image based on the generated feature map
  • the cascaded multi-level text box branch is used to generate a feature map based on the and
  • the feature corresponding to each candidate text area is used to predict the candidate horizontal text box
  • the mask branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and the prediction is The mask information of determines the final
  • a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device is prompted to perform the training described above The method of text position detection model.
  • a system including at least one computing device and at least one storage device storing instructions, wherein when the instructions are executed by the at least one computing device, the at least one A computing device executes the following steps of the implementation method of the machine learning modeling process: Obtain a training image sample set, wherein the text position in the training image sample is marked with a text box; training the text position based on the deep neural network based on the training image sample set A detection model, wherein the text position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches and mask branches, wherein the feature extraction layer is used to extract features of the image to generate a feature map, The candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the image based on the generated feature map, and the cascaded multi-level text box branches are used to predict the candidate level based on the feature corresponding to each candidate text region in the feature map
  • the text box, the mask branch is used to predict the mask information of the
  • a system for training a text position detection model may include: a training image sample set acquisition device configured to acquire a training image sample set, wherein the training image sample The text location is marked with a text box; the model training device is configured to train a text location detection model based on a deep neural network based on the training image sample set, wherein the text location detection model includes a feature extraction layer, a candidate region recommendation layer, Cascaded multi-level text box branches and mask branches, where the feature extraction layer is used to extract features of the image to generate a feature map, and the candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the image based on the generated feature map ,
  • the cascaded multi-level text box branch is used to predict the candidate horizontal text box based on the feature corresponding to each candidate text area in the feature map, and the mask branch is used to predict the candidate horizontal text box based on the feature map corresponding to the candidate horizontal text box. Predict the mask information of the text
  • the text position detection model according to an exemplary embodiment of the present disclosure includes cascaded multi-level text box branches, and the method and system for training a text detection model according to the exemplary embodiment of the present disclosure have performed the training sample set size before training. And/or rotation changes, the anchor box is redesigned, and a difficult sample learning mechanism is added in the training process. Therefore, the trained text position detection model can provide a better text position detection effect.
  • the method and system for locating a text position in an image can improve text detection performance by using a text position detection model including cascaded multi-level text box branches, and due to the introduction of two levels of non-
  • the maximum value suppression operation can effectively prevent missed detection and text box overlap, making it possible to locate not only horizontal text but also rotating text.
  • multi-scale transformation is performed on the acquired image to target different sizes of predicted image samples of the same image. Prediction and combining text boxes determined for different sizes of predicted image samples can further improve the effect of text position detection in the image.
  • FIG. 1 is a block diagram showing a system for training a text position detection model according to an exemplary embodiment of the present disclosure
  • Fig. 2 is a schematic diagram of a text position detection model according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a flowchart showing a method of training a text detection model according to an exemplary embodiment of the present disclosure
  • FIG. 4 is a block diagram showing a system for locating a text position in an image according to an exemplary embodiment of the present disclosure
  • FIG. 5 is a flowchart illustrating a method of locating a text position in an image according to an exemplary embodiment of the present disclosure.
  • “perform at least one of step one and step two” means the following three parallel situations: (1) perform step one; (2) perform step two; (3) Perform steps one and two. That is, “A and/or B” can also be expressed as “at least one of A and B”, and “perform step one and/or step two” can also be expressed as "perform one of step one and step two.” At least one of ".
  • FIG. 1 is a block diagram showing a system 100 for training a text position detection model according to an exemplary embodiment of the present disclosure (hereinafter, for convenience of description, it is simply referred to as a "model training system") 100.
  • the model training system 100 may include a training image sample set acquisition device 110 and a model training device 120.
  • the training image sample set obtaining device 110 may obtain a training image sample set.
  • the text position is marked with a text box in the training image samples of the training image sample set, that is, the text position is marked with a text box in the image.
  • the training image sample set obtaining device 110 may directly obtain training image sample sets generated by other devices from the outside, or the training image sample set obtaining device 110 may perform operations by itself to construct a training image sample set.
  • the training image sample set acquisition device 110 can acquire the training image sample set manually, semi-automatically or fully automatically, and process the acquired training image sample into an appropriate format or form.
  • the training image sample set obtaining device 110 may receive the training image sample set manually imported by the user through an input device (for example, a workstation), or the training image sample set obtaining device 110 may obtain the training image sample set from a data source in a fully automatic manner
  • the data source is systematically requested to send the training image sample set to the training image sample set acquisition device 110 through a timer mechanism implemented by software, firmware, hardware, or a combination thereof, or it can also be in the case of manual intervention
  • the acquisition of the training image sample set is automatically performed, for example, when a specific user input is received, the acquisition of the training image sample set is requested.
  • the training image sample set obtaining device 110 may store the obtained sample set in a non-volatile memory (for example, a data warehouse).
  • the model training device 120 may train a deep neural network-based text position detection model based on the training image sample set.
  • the deep neural network may be a convolutional neural network, but is not limited to this.
  • Fig. 2 shows a schematic diagram of a text position detection model according to an exemplary embodiment of the present disclosure.
  • the text position detection model may include a feature extraction layer 210, a candidate region recommendation layer 220, and a cascaded multi-level text box branch 230 (for ease of illustration, the multi-level text box branch is illustrated as including three Level text box branch, but this is only an example, the cascaded multi-level text box branch is not limited to only include three-level text box branch) and the mask branch 240.
  • the feature extraction layer can be used to extract features of the image to generate a feature map
  • the candidate region recommendation layer can be used to determine a predetermined number of candidate text regions in the image based on the generated feature map
  • the cascaded multi-level text box branch can be used to The feature corresponding to each candidate text area in the feature map is used to predict the candidate horizontal text box
  • the mask branch can be used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map
  • the final text box used to locate the text position in the image according to the predicted mask information.
  • the final text box may include a horizontal text box and/or a rotating text box.
  • the text detection model of the present disclosure can detect both horizontal text and rotated text.
  • the final text box in the present disclosure may include a horizontal text box and/or a rotating text box, and can also be expressed as: the final text box may include one of a horizontal text box and a rotating text box. At least one of them.
  • the text position detection model in Figure 2 may be based on the Mask-RCNN framework.
  • the feature extraction layer may correspond to the deep residual network (for example, resnet 101) in the Mask-RCNN framework
  • the candidate region recommendation layer may correspond to The regional recommendation network RPN layer in the Mask-RCNN framework.
  • Each level of the text box branch in the cascaded multi-level text box branch can include the RollAlign layer and the fully connected layer in the Mask-RCNN framework.
  • the mask branch includes a series of volumes. Buildup. Those skilled in the art are well aware of the functions and operations of the deep residual network, the RPN layer, the RollAlign layer, and the fully connected layer in the Mask-RCNN framework, so they will not be described in detail here.
  • the traditional Mask-RCNN framework not only includes only one text box branch, but after the RPN layer determines a predetermined number of candidate regions (for example, 2000), some candidate regions are randomly sampled from these candidate regions (For example, 512), and send the sampled candidate regions to the text box branch and the mask branch respectively.
  • a predetermined number of candidate regions for example, 2000
  • some candidate regions are randomly sampled from these candidate regions (For example, 512)
  • this structure and the operation of randomly sampling candidate regions to text box branch and mask branch respectively lead to the poor text position detection effect of the traditional Mask-RCNN framework. This is because the first-level text box branch can only detect the candidate regions that overlap with the real text box label within a certain range, and random sampling is not conducive to the model's learning of difficult samples.
  • the model training system 100 In addition to the training image sample set acquisition device 110 and the model training device 120, a preprocessing device (not shown) may also be included.
  • the preprocessing device may perform size transformation and/or transmission transformation on the training image samples in the training image sample set to obtain the transformed training image sample set before training the text position detection model based on the training image sample set, so that The training image samples are closer to the real scene.
  • the preprocessing device may perform random size transformation on the training image sample without maintaining the original aspect ratio of the training image sample so that the width and height of the training image sample are within a predetermined range.
  • the reason why the original aspect ratio of the training image samples is not maintained is to simulate compression and stretching in a real scene.
  • the width and height of the training image sample may be randomly transformed to between 640 and 2560 pixels, but the predetermined range is not limited to this.
  • performing transmission transformation on the training image sample may include randomly rotating the coordinates of the pixels in the training image sample around the x-axis, the y-axis, and the z-axis, respectively.
  • each pixel in the training image sample can be randomly rotated around the x axis (-45, 45), randomly rotated around the y axis (-45, 45), and randomly rotated around the z axis (-30, 30), after enhancement
  • the training image samples will be more in line with the real scene.
  • the coordinates of the text box can be transformed by the following equation:
  • ⁇ x is a random rotation around the x axis (-45, 45)
  • ⁇ y is a random rotation around the y axis (-45, 45)
  • ⁇ z is a random rotation around the z axis (-30, 30).
  • the coordinate before transformation usually the value of z is 1,
  • the size transformation and/or transmission transformation of the training image samples in the training image sample set in the present disclosure to obtain the transformed training image sample set can also be expressed as: training on the training image sample set The image samples undergo at least one of size transformation and transmission transformation to obtain a transformed training image sample set.
  • the model training device 120 may train the text detection model based on the transformed training image sample set. Specifically, the model training device 120 may perform the following operations to train the aforementioned text detection model: input the transformed training image sample into the aforementioned text position detection model; use the feature extraction layer to extract the features of the input training image sample to generate a feature map; The candidate region recommendation layer determines a predetermined number of candidate text regions in the input training image samples based on the generated feature map; uses cascaded multi-level text box branches to predict candidates based on the feature map corresponding to each candidate text region Horizontal text box, and calculate the text box prediction loss corresponding to each candidate text area according to the prediction result of the text box branch and the text box label; sort the predetermined number of candidate text areas according to their corresponding text box prediction loss , And filter out the first specific number of candidate text regions with the largest prediction loss of the text box according to the sorting results; use the mask branch to predict the selected candidate text regions based on the features in the feature map
  • the characteristics of the image may include the correlation of pixels in the image, but it is not limited thereto.
  • the model training device 120 can use the feature extraction layer to extract the correlation of pixels in the training image sample to generate a feature map. Subsequently, the model training device 120 may use the candidate area recommendation layer to predict the difference between the candidate text area and the preset anchor point frame based on the generated feature map, determine the initial candidate text area based on the difference and the anchor point frame, and use non-polar
  • the large value suppression operation filters out the predetermined number of candidate text regions from the initial candidate text regions.
  • the present disclosure uses a non-maximum value suppression operation to filter the initial candidate text regions.
  • the non-maximum value suppression operation will be briefly described. Specifically, starting from the initial candidate text area with the smallest difference from the anchor point box, it can be judged whether the overlap between other initial candidate text boxes and the initial candidate text area is greater than a set threshold, and if there is something greater than the threshold The initial candidate text area is removed, that is, the initial candidate text area whose overlap is less than the threshold is retained. Then, among all the retained initial candidate text areas, select an initial candidate text area with the smallest difference from the anchor point box, and continue to determine the degree of overlap between the initial candidate text area and other initial candidate text areas, if they overlap If the degree is greater than the threshold, delete it, otherwise keep it until a predetermined number of candidate text regions are filtered out.
  • the preset anchor point box is each possible text box in the preset image for matching with the real text box.
  • the anchor point aspect ratio set of the traditional Mask-RCNN framework-based model is fixed, the set is [0.5,1,2], that is to say, the anchor point aspect ratio is only 0.5, 1, and 2.
  • Three kinds. Using these three aspect ratio anchor points can basically cover the target on some general target detection data sets (for example, the coco data set), but it is far from enough to cover the text in the text scene. This is because the aspect ratio range in the text scene is very large, 1:5, 5:1 text is very common, if the traditional Mask-RCNN only has three kinds of anchor boxes with fixed aspect ratios will lead to anchor boxes It does not match the real text box, resulting in missed text detection.
  • the model training device 120 may also calculate the aspect ratios of all text boxes marked in the transformed training image sample set before training the text position detection model, and according to the statistics of all text
  • the aspect ratio of the frame sets the aspect ratio set of the anchor point frame. That is to say, the present disclosure can redesign the aspect ratio of the anchor point frame.
  • the aspect ratios of all the text boxes marked in the transformed training image sample set can be sorted, and the anchor points can be determined according to the sorted aspect ratios
  • the upper limit and lower limit of the aspect ratio of the frame are interpolated in equal proportions between the upper limit and the lower limit, and the set consisting of the upper limit, the lower limit and the value obtained by interpolation is taken as the set
  • the set of aspect ratios of the anchor box For example, the aspect ratio of all text boxes can be sorted from smallest to largest, and the aspect ratio at 5% and the aspect ratio at 95% can be determined as the lower limit and upper limit of the aspect ratio of the anchor box, respectively.
  • the above method of determining the set of aspect ratios of the anchor point frame is only an example, and the method of selecting the upper limit value and the lower limit value, and the method and frequency of interpolation are not limited to the above example.
  • the model training device 120 can use the cascaded multi-level text box branch to predict the relationship between each candidate text region and the feature map corresponding to each candidate text region.
  • the position deviation between the text box marks and the confidence of each candidate text area including and not including text, and the prediction loss of the text box corresponding to each candidate text area is calculated according to the predicted position deviation and the confidence.
  • the cascaded multi-level text box branch may be a three-level text box branch, but it is not limited thereto.
  • the present disclosure proposes a difficult sample learning mechanism, that is, the predetermined number of candidate text regions are sorted according to their corresponding text box prediction loss, and the text box with the largest prediction loss is selected according to the sorting result.
  • the first specific number of candidate text areas, and the selected candidate text areas are input into the mask branch for mask information prediction.
  • 512 candidate text regions with larger text box prediction loss can be selected from 2000 candidate regions according to the text box prediction loss.
  • the model training device 120 may calculate the text box prediction loss corresponding to each candidate text region according to the position deviation and the confidence of the branch prediction using the text box.
  • the model training device 120 can calculate the text box prediction loss of each level of text box branch according to the prediction result of each level of text box branch and the real text box label, and pass The prediction loss of the text box of each text box branch is summed to determine the prediction loss of the text box corresponding to each candidate text area.
  • the text box prediction loss includes the confidence prediction loss and the position deviation prediction loss corresponding to each candidate text area.
  • the overlap threshold set for each level of text box branch to calculate the text box prediction loss for each level of text box branch is different from each other, and the overlap threshold set for the previous level of text box branch is smaller than that for the next level The overlap threshold set by the text box branch.
  • the overlap threshold is the overlap threshold between the horizontal text box and the text box label predicted by each level of text box branch.
  • the degree of overlap (IOU) can be the value obtained by dividing the intersection between two text boxes by the union of the two text boxes.
  • the overlap thresholds set for the first-level text box branch to the third-level text box branch may be 0.5, 0.6, and 0.7, respectively.
  • the candidate text area It is determined to be a positive sample for the first-level text box branch, and less than 0.5 is determined to be a negative sample.
  • the threshold value is 0.5
  • there will be more false detections because the 0.5 threshold will cause more background in the positive sample, which is the reason for more text position false detections.
  • an overlap threshold of 0.7 is used, false detections can be reduced, but the detection effect is not necessarily the best. The main reason is that the higher the overlap threshold, the fewer the number of positive samples, and the greater the risk of overfitting.
  • the present disclosure adopts cascaded multi-level text box branches, and the overlap thresholds set for each level of text box branch for calculating the text box prediction loss of each level of text box branch are different from each other, and are different for the previous one.
  • the overlap threshold set for the first-level text box branch is less than the overlap threshold set for the next-level text box branch, so that each level of the text box branch can focus on detecting candidates whose overlap with the real text box mark is within a certain range Text area, so the text detection effect will get better and better.
  • the model training device 120 can use the mask branch to predict the mask in the selected candidate text regions based on the features corresponding to the selected candidate text regions in the feature map.
  • Information specifically, the mask of pixels predicted to be text can be set to 1, and the mask of pixels that are not text can be set to 0
  • the mask is calculated by comparing the predicted mask information with the actual mask information of the text
  • the film predicts the loss.
  • the model training device 120 may use the correlation between pixels in the selected candidate text region to predict the mask information.
  • the mask values of the pixels in the text box mark are all 1, and this is taken as the real mask information.
  • the model training device 120 can train the text position detection model by continuously using the training image samples until the sum of the prediction loss of all text boxes and the mask prediction loss is minimized, thereby completing the training of the text position detection model.
  • the model training system and the text position detection model according to the exemplary embodiment of the present disclosure have been described with reference to FIGS. 1 and 2. Since the text position detection model of the present disclosure includes cascaded multi-level text box branches, and the training sample set is changed in size and/or rotation before training, the anchor box is redesigned, and difficulties are added in the training process. Sample learning mechanism, therefore, the trained text position detection model can provide better text position detection effect.
  • the size and/or rotation change of the training sample set before training in the present disclosure can also be expressed as: at least one of the size change and rotation change is performed on the training sample set before training One.
  • model training system 100 is described above as being divided into devices for performing corresponding processing (for example, the training image sample set acquisition device 110 and the model training device 120), it is clear to those skilled in the art. However, the processing performed by the above-mentioned devices can also be performed when the model training system 100 does not perform any specific device division or there is no clear demarcation between the devices.
  • the model training system 100 described above with reference to FIG. 1 is not limited to include the devices described above, but some other devices (for example, storage devices, data processing devices, etc.) may be added as needed, or the above devices may also be used. combination.
  • FIG. 3 is a flowchart illustrating a method of training a text position detection model according to an exemplary embodiment of the present disclosure (hereinafter, for convenience of description, it is simply referred to as "model training method").
  • the model training method shown in FIG. 3 can be executed by the model training system 100 shown in FIG. 1, or it can be implemented in software completely through computer programs or instructions, or through a specifically configured computing system or computing device.
  • it may be executed by a system including at least one computing device and at least one storage device storing instructions, wherein when the instructions are executed by the at least one computing device, the at least one computing device is Model training method.
  • the model training method shown in FIG. 3 is executed by the model training system 100 shown in FIG. 1, and it is assumed that the model training system 100 may have the configuration shown in FIG. 1.
  • the training image sample set obtaining device 110 may obtain a training image sample set, wherein the text position in the training image sample is marked with a text box.
  • the model training device 120 may train a deep neural network-based text position detection model based on the training image sample set.
  • the text position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used to extract features of the image to generate a feature map
  • the candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the image based on the generated feature map
  • the cascaded multi-level text box branches are used to predict the candidate level based on the feature corresponding to each candidate text region in the feature map
  • the text box, the mask branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and determine the text in the image based on the predicted mask information The location of the final text box.
  • the text position detection model can be based on the Mask-RCNN framework, the feature extraction layer corresponds to the deep residual network in the Mask-RCNN framework, and the candidate region recommendation layer corresponds to the RPN layer of the regional recommendation network in the Mask-RCNN framework, cascaded
  • Each level of the text box branch in the multi-level text box branch includes the RollAlign layer and the fully connected layer in the Mask-RCNN framework, and the mask branch includes a series of convolutional layers.
  • the characteristics of the image may include the correlation of pixels in the image, but are not limited thereto.
  • the final text box may include a horizontal text box and/or a rotating text box.
  • the model training method may further include a step (not shown) of transforming the acquired training image sample set between step S310 and step S320.
  • a step (not shown) of transforming the acquired training image sample set between step S310 and step S320.
  • the training image sample in the training image sample set may be subjected to size transformation and/or transmission transformation to obtain the transformed training Image sample set.
  • size transformation and/or transmission transformation to obtain the transformed training Image sample set.
  • the model training device 120 may perform the following operations to train the text position detection model: input the transformed training image samples into the text position detection model; use the feature extraction layer to extract the input To generate a feature map based on the characteristics of the training image sample of the candidate region; use the candidate region recommendation layer to determine a predetermined number of candidate text regions in the input training image sample based on the generated feature map; use the cascaded multi-level text box branch based on the feature map
  • the feature corresponding to each candidate text area predicts the position deviation between each candidate text area and the text box mark, and the confidence of each candidate text area including and excluding text, and based on the predicted position deviation Calculate the prediction loss of the text box corresponding to each candidate text area with the confidence level; sort the predetermined number of candidate text areas according to their corresponding text box prediction loss, and filter out the text box prediction loss with the largest prediction loss according to the ranking result A specific number of candidate text areas; use mask branch to predict the mask information in the selected candidate text area based on the features
  • the model training device 120 may use the candidate region recommendation layer to predict the candidate text region based on the generated feature map and preset According to the difference between the anchor point boxes, the initial candidate text area is determined according to the difference and the anchor point box, and the predetermined number of candidate text areas are filtered out from the initial candidate text area by using a non-maximum value suppression operation.
  • the model training method shown in FIG. 3 may further include a step (not shown) of setting an anchor point frame.
  • this step may include: before training the text position detection model, counting the transformed training image samples The aspect ratios of all the marked text boxes are concentrated, and the aspect ratio set of the anchor point boxes is set according to the statistical aspect ratios of all the text boxes.
  • this step may also include: setting the size of the anchor box according to the size of the statistical text box, or setting the size of the anchor box to some fixed size, for example, 16 ⁇ 16, 32 ⁇ 32, 64 ⁇ 64, 128 ⁇ 128 and 256 ⁇ 256.
  • the present disclosure does not limit the size of the anchor box or the method of setting the size of the anchor box. This is because, generally for text position detection, the setting of the anchor box aspect ratio is important for text The effect of detection is greater.
  • the aspect ratio set of the anchor point box can be set by the following operations: sort the aspect ratios of all the text boxes in statistics; determine the aspect ratio of the anchor point box according to the sorted aspect ratio
  • the upper limit and lower limit are interpolated equally between the upper limit and the lower limit, and the set consisting of the upper limit, the lower limit and the value obtained by interpolation is used as the anchor point box
  • the aspect ratio collection can be set by the following operations: sort the aspect ratios of all the text boxes in statistics; determine the aspect ratio of the anchor point box according to the sorted aspect ratio
  • the upper limit and lower limit are interpolated equally between the upper limit and the lower limit, and the set consisting of the upper limit, the lower limit and the value obtained by interpolation is used as the anchor point box
  • the aspect ratio collection can be set by the following operations: sort the aspect ratios of all the text boxes in statistics; determine the aspect ratio of the anchor point box according to the sorted aspect ratio
  • the upper limit and lower limit are interpolated equally between the upper limit and the lower limit,
  • the cascaded multi-level text box branch may be a three-level text box branch, but is not limited thereto.
  • the prediction loss of the text box corresponding to each candidate text area according to the predicted position deviation and confidence
  • the text box prediction for each level of text box branch for each level of text box branch.
  • the related description of the overlap threshold of the loss can also be referred to the corresponding description of FIG. 1, which will not be repeated here.
  • the model training method shown in FIG. 3 is executed by the model training system 100 described in FIG. 1, therefore, the content mentioned above when describing the various devices included in the model training system with reference to FIG. 1 are applicable here. Therefore, for the relevant details involved in the above steps, please refer to the corresponding description in FIG. 1, which will not be repeated here.
  • the text position detection model includes cascaded multi-level text box branches, and the training sample set is changed in size and/or rotation before training, the anchor point box is redesigned , And a difficult sample learning mechanism is added in the training process. Therefore, the text position detection model trained by the above-mentioned model training method can provide a better text position detection effect.
  • FIG. 4 is a block diagram showing a system for locating a text position in an image according to an exemplary embodiment of the present disclosure (hereinafter, for convenience of description, it is simply referred to as a "text positioning system") 400.
  • the text positioning system 400 may include a predictive image sample acquisition device 410 and a text position positioning device 420.
  • the predictive image sample obtaining device 410 may be configured to obtain predictive image samples
  • the text position location device 420 may be configured to use a pre-trained deep neural network-based text position detection model to determine the location of text in the predicted image sample. The location of the final text box.
  • the text position detection model may include a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used to extract features of the predicted image sample to generate a feature map, and the candidate region
  • the recommendation layer is used to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map, and the cascaded multi-level text box branch is used to predict the candidate level based on the feature corresponding to each candidate text region in the feature map Text box, the mask branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and determine the mask information for the predicted image sample according to the predicted mask information The final text box to locate the text position.
  • predicting the characteristics of the image sample may predict the correlation of pixels in the image sample, but is not limited to this.
  • the text position detection model can be based on the Mask-RCNN framework, and the feature extraction layer corresponds to the deep residual network in the Mask-RCNN framework, and the candidate region recommendation layer corresponds to the regional recommendation network RPN layer in the Mask-RCNN framework ,
  • Each level of text box branch in the cascaded multi-level text box branch includes the RollAlign layer and the fully connected layer in the Mask-RCNN framework, and the mask branch may include a series of convolutional layers.
  • the predicted image sample obtaining device 410 may first obtain an image, and then perform multi-scale scaling on the obtained image to obtain a plurality of predicted image samples of different sizes corresponding to the image.
  • the text position locating device 420 may use a pre-trained text position detection model for multiple predicted image samples of different sizes to determine the final text box used to locate the text position in the predicted image sample.
  • the text boxes determined by the predicted image samples of the size are merged to obtain the final result.
  • the image can be derived from any data source, and the present disclosure does not limit the source of the image, the specific method of obtaining the image, and so on.
  • the text position locating device 420 can determine the final text box for locating the text position in the predicted image sample by performing the following operations: Use a feature extraction layer to extract the features of the predicted image sample to generate features Figure; Using the candidate region recommendation layer to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map; Using cascaded multi-level text box branches to predict the feature corresponding to each candidate text region in the feature map Initial candidate horizontal text box, and filter the initial candidate horizontal text box through the first non-maximum value suppression operation to filter out the horizontal text box whose coincidence degree is less than the first coincidence degree threshold as the candidate horizontal text box; using mask branch, Predict the mask information of the text in the candidate horizontal text box based on the features in the feature map corresponding to the candidate horizontal text box, determine the primary text box based on the predicted mask information of the text, and suppress it by the second non-maximum value The operation filters out the text boxes whose coincidence degree is less than the second coincidence degree threshold from the determined primary
  • the text position locating device 420 may merge the text boxes determined for the predicted image samples of different sizes. Specifically, for the predicted image sample of the first size, the text position locating device 420 may use the text position detection model to determine a text box for locating the text position in the predicted image sample of the first size. Select the first text box with a size greater than the first threshold in the box, and for the predicted image sample of the second size, use the text position detection model to determine the location of the text in the predicted image sample of the second size. After the text box, a second text box with a size smaller than the second threshold is selected from the text box, wherein the first size is smaller than the second size.
  • the text position locating device 420 can reserve the relatively large text box and filter out the relatively small text box (specifically, it can be set by the above-mentioned first threshold.
  • the text position locating device 420 can retain a relatively small text box and filter out a relatively large text box (specifically, it can pass the above-mentioned second threshold Set to keep). Next, the text position locating device 420 may merge the filtered results. Specifically, the text position locating device 420 can use the third non-maximum value suppression operation to filter the selected first text box and the second text box to obtain the final text box for locating the text position in the image .
  • the text position locating device 420 may rank all selected first text boxes and second text boxes according to their confidence levels and select the text box with the highest confidence level, and then calculate the degree of overlap between the remaining text boxes and the text box, If the degree of overlap is greater than the threshold, delete it, otherwise keep it, and the final retained text box is the final text box that locates the text position in the image.
  • the text positioning device 420 can use the feature extraction layer to extract the features of the predicted image sample to generate a feature map.
  • the Mask-RCNN framework can be used.
  • the deep residual network in (for example, resnet101) extracts the correlation between pixels of the predicted image sample as a feature to generate a feature map.
  • the present disclosure does not have any restrictions on the features of the predicted image samples used and the specific feature extraction methods.
  • the text position locating device 420 can use the candidate region recommendation layer to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map. For example, the text position locating device 420 can use the candidate region recommendation layer based on the generated features.
  • the map predicts the difference between the candidate text area and the preset anchor point box, determines the initial candidate text area based on the difference and the anchor point box, and uses the fourth non-maximum value suppression operation to filter out the initial candidate text area A predetermined number of candidate text areas.
  • the aspect ratio of the anchor point box may be determined by performing statistics on the aspect ratio of the marked text boxes in the training image sample set in the training phase of the text position detection model as described above. The specific details of using the non-maximum value suppression operation to filter out the predetermined number of candidate text regions from the initial candidate text regions have been mentioned in the description with reference to FIG. 1, and therefore, will not be repeated here.
  • the text position locating device 420 can use the cascaded multi-level text box branch to predict the initial candidate horizontal text box based on the feature corresponding to each candidate text area in the feature map, and perform the first non-maximum value suppression operation from the initial In the candidate horizontal text box, a horizontal text box whose coincidence degree of the text box is less than the first coincidence degree threshold is selected as the candidate horizontal text box.
  • the cascaded multi-level text box branch may be a three-level text box branch.
  • a three-level text box is used as an example to use the cascaded multi-level text box branch based on each candidate text in the feature map. The feature prediction corresponding to the region is described in the initial candidate horizontal text box.
  • the text location locating device 420 may first use the first-level text box branch to extract the feature corresponding to each candidate text area from the feature map and predict the position deviation of each candidate text area from the real text area and each candidate
  • the text area includes the confidence of the text and the confidence of not including the text
  • the first-level text box is determined according to the prediction result of the first-level text box branch.
  • the text position locating device 420 may use the RollAlign layer in the first-level text box branch to extract the feature corresponding to each candidate text area from the feature map, and use the fully connected layer in the first-level text box branch to predict each The positional deviation between the candidate text area and the real text area and the confidence that each candidate text area includes text and does not include text. Then, the text position locating device 420 can remove part of the candidate text regions with lower confidence according to the predicted confidence, and determine the first-level horizontal text box based on the retained candidate text regions and their position deviations from the real text regions.
  • the text position locating device 420 can use the second-level text box branch to extract the features corresponding to the first-level horizontal text box from the feature map and predict whether the first-level horizontal text box is true or not.
  • the position deviation of the text area and the confidence that the first-level horizontal text box includes and does not include the text, and the second-level horizontal text box is determined according to the prediction result of the branch of the second-level text box.
  • the text position locating device 420 can use the RollAlign layer in the second-level text box branch to extract the features corresponding to the first-level horizontal text box from the feature map (that is, extract the features corresponding to the first-level horizontal text box).
  • the text position locating device 420 may remove part of the first-level horizontal text boxes with lower confidence according to the predicted confidence, and determine the second-level based on the retained first-level horizontal text boxes and their position deviations from the real text area Horizontal text box.
  • the text position locating device 420 can use the third-level text box branch to extract the features corresponding to the second-level horizontal text box from the feature map and predict whether the second-level horizontal text box is true or not.
  • the position deviation of the text area and the confidence that the second-level horizontal text box includes and does not include the text, and the initial candidate horizontal text box is determined according to the prediction result of the branch of the third-level text box.
  • the text position locating device 420 can use the RolAlign layer in the third-level text box branch to extract the features corresponding to the second-level horizontal text box from the feature map (that is, extract the features corresponding to the second-level horizontal text box Pixel area corresponding feature), and use the fully connected layer in the third-level text box branch to predict the position deviation of the second-level horizontal text box and the real text area, and the confidence that the second-level horizontal text box includes text and does not include text Confidence level. Then, the text position locating device 420 can remove part of the second-level horizontal text boxes with lower confidence according to the predicted confidence, and determine the initial candidate level according to the retained second-level horizontal text boxes and their position deviations from the real text area Text box.
  • the text position locating device 420 can filter the initial candidate horizontal text box through the first non-maximum value suppression operation to filter out the level of the text box coincidence degree less than the first coincidence degree threshold.
  • the text box serves as a candidate horizontal text box.
  • the text position locating device 420 may first select the initial candidate horizontal text box with the greatest confidence according to the confidence of the initial candidate horizontal text box, and then calculate the text of the remaining initial candidate horizontal text boxes and the initial candidate horizontal text box with the greatest confidence Box coincidence degree. If the text box coincidence degree is less than the first coincidence degree threshold, keep it, otherwise delete it. All remaining horizontal text boxes are used as candidate horizontal text boxes to enter the mask branch.
  • the text position locating device 420 can use the mask branch to predict the mask information of the text in the candidate horizontal text box based on the feature corresponding to the candidate horizontal text box in the feature map. Specifically, for example, the text position locating device 420 may predict the mask information of the text in the candidate horizontal text box based on the pixel correlation characteristics corresponding to the pixels in the candidate horizontal text box in the feature map. Subsequently, the text position locating device 420 may determine the primary selected text box according to the predicted mask information of the text. Specifically, for example, the text position locating device 420 may determine the smallest circumscribed rectangle containing the text according to the predicted mask information of the text, and use the determined smallest circumscribed rectangle as the primary text box. For example, the text position locating device 420 may use the minimum circumscribed rectangle function to determine the minimum outer rectangle containing the text according to the predicted mask information of the text.
  • the text position locating device 420 may filter the determined primary selected text box through the second non-maximum value suppression operation to select the text box whose coincidence degree is less than the second coincidence degree threshold as the The final text box. Specifically, for example, the text position locating device 420 may first select the initial candidate horizontal text box with the greatest confidence according to the confidence of the initial candidate horizontal text box, and then calculate the remaining initial candidate horizontal text boxes and the initial candidate horizontal text box with the greatest confidence If the coincidence degree of the text box is less than the first coincidence degree threshold, keep it, otherwise delete it.
  • the above-mentioned first coincidence degree threshold is greater than the second coincidence degree threshold.
  • the coincidence threshold is fixed to 0.5, that is to say, horizontal text boxes with a coincidence degree higher than 0.5 will be deleted during screening.
  • the coincidence degree threshold is set to 0.5, some text boxes will be missed.
  • the coincidence degree threshold is increased (for example, the coincidence degree threshold is set to 0.8, that is, the text boxes with the coincidence degree higher than 0.8 are deleted), the horizontal text boxes of the final prediction will overlap more.
  • the present disclosure proposes the concept of two-level non-maximum suppression.
  • the first non-maximum value suppression operation is performed to filter out the text box coincidence degree less than the first candidate horizontal text box.
  • the horizontal text box with the coincidence threshold is used as the candidate horizontal text box.
  • the second non-maximum value suppression operation is performed from the determined In the primary selected text box, a text box whose coincidence degree of the text box is less than the second coincidence degree threshold is filtered out as the final text box.
  • the first non-maximum value can be used to suppress the operation first.
  • the text boxes determined by the cascaded multi-level text box branches are coarsely screened, and then the second non-maximum value suppression operation is used to finely screen the text boxes determined by the mask branches.
  • the two-level non-maximum value suppression operation and the adjustment of the coincidence degree threshold used in the two-level non-maximum value suppression operation not only the horizontal text but also the rotated text can be positioned.
  • the text positioning system 400 shown in FIG. 4 may further include a display device (not shown).
  • the display device can display the final text box used to locate the text position in the image on the image, so that the user can intuitively determine the text position.
  • the final text box includes a horizontal text box and/or a rotating text box.
  • the text positioning system can improve text detection performance by using a text position detection model including cascaded multi-level text box branches, and can effectively prevent missed detection and detection due to the introduction of a two-level non-maximum value suppression operation.
  • the text boxes overlap so that not only horizontal text but also rotated text can be positioned.
  • multi-scale transformation on the acquired image, predicting the predicted image samples of different sizes in the same image and merging the text boxes determined for the predicted image samples of different sizes, which can further improve the text position detection effect, so that even When different sizes of text exist in the image at the same time, it can also provide a better text position detection effect.
  • the text positioning system 400 is described above as being divided into devices for performing corresponding processing (for example, the predictive image sample acquisition device 410 and the text location positioning device 420), the technology in the art It is clear to the person that the processing performed by the above-mentioned devices can also be performed without any specific device division by the text positioning system 400 or there is no clear demarcation between the devices.
  • the text positioning system 400 described above with reference to FIG. 4 is not limited to including the predictive image sample acquisition device 410, the text position positioning device 420, and the display device described above, but may also add some other devices (for example, storage Device, data processing device, etc.), or the above devices can also be combined.
  • the model training system 100 and the text positioning system 400 described above with reference to FIG. 1 may also be combined into one system, or they may be independent systems, and the present disclosure is not limited thereto.
  • FIG. 5 is a flowchart showing a method for locating a text position in an image according to an exemplary embodiment of the present disclosure (hereinafter, for convenience of description, it is simply referred to as a "text positioning method").
  • the text locating method shown in FIG. 5 can be executed by the text locating system 400 shown in FIG. 4, it can also be implemented in software completely through computer programs or instructions, or through a specifically configured computing system or computing device. For example, it may be executed by a system including at least one computing device and at least one storage device storing instructions, wherein when the instructions are executed by the at least one computing device, the at least one computing device is Text positioning method.
  • the text positioning method shown in FIG. 5 is executed by the text positioning system 400 shown in FIG. 4, and it is assumed that the text positioning system 400 may have the configuration shown in FIG.
  • the predictive image sample obtaining device 410 may obtain predictive image samples.
  • the predicted image sample obtaining device 410 may first obtain an image, and then perform multi-scale scaling on the obtained image to obtain multiple predicted image samples of different sizes corresponding to the image.
  • the text position locating device 420 may use a pre-trained deep neural network-based text position detection model to determine a final text box for locating the text position in the predicted image sample.
  • the text position detection model may include a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer can be used to extract features of the predicted image sample to generate a feature map
  • the candidate region recommendation layer can be used to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map, and cascaded multi-level text boxes
  • the branch can be used to predict the candidate horizontal text box based on the feature corresponding to each candidate text area in the feature map
  • the mask branch can be used to predict the text in the candidate horizontal text box based on the feature corresponding to the candidate horizontal text box in the feature map
  • the final text box used to locate the text position in the predicted image sample is determined.
  • the text position detection model can be based on the Mask-RCNN framework
  • the feature extraction layer can correspond to the deep residual network in the Mask-RCNN framework
  • the candidate region recommendation layer can correspond to the RPN layer of the region recommendation network in the Mask-RCNN framework.
  • Each level of text box branch in the cascaded multi-level text box branch may include a RollAlign layer and a fully connected layer in the Mask-RCNN framework, and the mask branch may include a series of convolutional layers.
  • the aforementioned characteristics of the predicted image sample may include the correlation degree of pixels in the predicted image sample, but it is not limited thereto.
  • the text position locating device 420 may first use the feature extraction layer to extract the features of the predicted image sample to generate a feature map, and use the candidate region recommendation layer to determine a predetermined number of predicted image samples based on the generated feature map. Candidate text area. Then, the text position locating device 420 may use the cascaded multi-level text box branch to predict the initial candidate horizontal text box based on the feature corresponding to each candidate text area in the feature map, and perform the first non-maximum value suppression operation from the initial In the candidate horizontal text box, a horizontal text box whose coincidence degree of the text box is less than the first coincidence degree threshold is selected as the candidate horizontal text box.
  • the text position locating device 420 may use the mask branch to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and determine the mask information according to the predicted text.
  • a text box is initially selected, and a text box with a text box coincidence degree less than a second coincidence degree threshold is selected from the determined primary selection text box through a second non-maximum value suppression operation as the final text box.
  • the first coincidence degree threshold is greater than the second coincidence degree threshold.
  • the text positioning method may further include The prediction results of the predicted image samples are merged (not shown).
  • the text position locating device 420 may use the text position detection model to determine a text box for locating the text position in the predicted image sample of the first size.
  • the text position locating device 420 can use the text position detection model to determine the size After locating the text box at the text position in the predicted image sample, select a second text box with a size smaller than a second threshold from the text box, wherein the first size is smaller than the second size. Subsequently, in this step, the text position locating device 420 can use the third non-maximum value suppression operation to filter the selected first text box and the second text box, so as to obtain the information used to locate the text position in the image. The final text box.
  • the text position locating device 420 can use the candidate region recommendation layer to determine a predetermined number of candidate text regions in the predicted image sample based on the generated feature map.
  • the text position location device 520 may use the candidate area recommendation layer to predict the difference between the candidate text area and the preset anchor point frame based on the generated feature map, and determine the initial candidate text area based on the difference and the anchor point frame, And using a fourth non-maximum value suppression operation to filter out the predetermined number of candidate text regions from the initial candidate text regions.
  • the aspect ratio of the anchor point box may be determined by comparing the marked text in the training image sample set during the training phase of the text position detection model (the training of the text position detection model is described above with reference to FIGS. 1 and 3).
  • the aspect ratio of the frame is determined by statistics.
  • the above-mentioned cascaded multi-level text box branch may be a three-level text box branch.
  • the cascading multi-level text box branch mentioned in the description of step S520 predicts the initial candidate level based on the feature corresponding to each candidate text area in the feature map The operation of the text box is briefly described.
  • the text position locating device 420 can use the first-level text box branch to extract the feature corresponding to each candidate text area from the feature map and predict the position deviation of each candidate text area from the real text area and each candidate text
  • the region includes the confidence of the text and the confidence of not including the text
  • the first-level horizontal text box is determined according to the prediction result of the first-level text box branch
  • the text position locating device 420 can use the second-level text box branch to start from
  • the feature map extracts the features corresponding to the first-level horizontal text box and predicts the positional deviation of the first-level horizontal text box and the real text area, and the confidence that the first-level horizontal text box includes and does not include text, and Determine the second-level horizontal text box according to the prediction result of the second-level text box branch
  • the text position locating device 420 can use the third-level text box branch to extract the features corresponding to the second-level horizontal text box from the feature map and combine Predict the positional deviation between the second-level horizontal text box and the real text area and
  • the primary selection text box is determined according to the mask information of the predicted text.
  • the text position locating device 420 may determine the minimum circumscribed rectangle containing the text according to the predicted mask information of the text, and use the determined minimum circumscribed rectangle as the primary selected text box.
  • the text positioning system 400 may further include a display device. Accordingly, after step S5290, the text positioning method shown in FIG. 5 may include displaying on the image for positioning in the image
  • the final text box for the text position may include a horizontal text box and/or a rotating text box.
  • the text positioning method can improve the text position detection performance by using a text position detection model including cascaded multi-level text box branches, and can effectively prevent missed detection due to the introduction of a two-level non-maximum value suppression operation Overlap with the text box, so that not only the horizontal text but also the rotated text can be positioned.
  • the text position detection effect can be further improved.
  • model training system and the model training method, the text positioning system and the text positioning method according to the exemplary embodiment of the present disclosure have been described above with reference to FIGS. 1 to 5.
  • FIG. 1 and FIG. 4 may be respectively configured as software, hardware, firmware or any combination of the foregoing to perform specific functions.
  • these systems or devices can correspond to dedicated integrated circuits, can also correspond to pure software codes, and can also correspond to modules combining software and hardware.
  • one or more functions implemented by these systems or devices may also be uniformly performed by components in physical physical devices (for example, processors, clients, or servers).
  • a computer-readable storage medium storing instructions can be provided, wherein when the instructions are When at least one computing device is running, the at least one computing device is prompted to perform the following steps: acquiring a training image sample set, wherein the text position in the training image sample is marked with a text box; training a deep neural network based on the training image sample set A text location detection model, wherein the text location detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches and mask branches, wherein the feature extraction layer is used to extract features of an image to generate features Figure, the candidate region recommendation layer is used to determine a predetermined number of candidate text regions in the image based on the generated feature map, and the cascaded multi-level text box branches are used to predict based on the feature corresponding to each candidate text region in the feature map Candidate horizontal text box, the mask branch is used to predict the mask
  • a computer-readable storage medium storing instructions may be provided, wherein when the instructions are executed by at least one computing device, the at least one computing device is prompted to perform the following steps : Obtain a predicted image sample; use a pre-trained deep neural network-based text location detection model to determine the final text box for locating the text location in the predicted image sample, where the text location detection model includes a feature extraction layer and a candidate Regional recommendation layer, cascaded multi-level text box branch and mask branch.
  • the feature extraction layer is used to extract the features of the predicted image sample to generate a feature map
  • the candidate area recommendation layer is used to predict the image sample based on the generated feature map.
  • a predetermined number of candidate text regions are determined in the, the cascaded multi-level text box branch is used to predict the candidate horizontal text box based on the feature corresponding to each candidate text region in the feature map, and the mask branch is used to predict the horizontal text box based on the feature map and The feature corresponding to the candidate horizontal text box is used to predict the mask information of the text in the candidate horizontal text box, and the final text box used to locate the text position in the predicted image sample is determined according to the predicted mask information.
  • the instructions stored in the above-mentioned computer-readable storage medium can be executed in an environment deployed in computer equipment such as a client, a host, an agent device, and a server. It should be noted that the instructions can also be executed when the above-mentioned steps are executed. Processing, the content of these further processing has been mentioned in the process described with reference to FIG. 3 and FIG. 5, so in order to avoid repetition, it will not be repeated here.
  • model training system and the text positioning system can completely rely on the execution of computer programs or instructions to achieve corresponding functions, that is, each device corresponds to each step in the functional architecture of the computer program, so that The whole system is called through a special software package (for example, lib library) to realize corresponding functions.
  • lib library for example, lib library
  • the program codes or code segments for performing the corresponding operations can be stored in a computer-readable storage medium such as In the medium, at least one processor or at least one computing device can perform corresponding operations by reading and running corresponding program codes or code segments.
  • a system including at least one computing device and at least one storage device storing instructions, wherein when the instructions are executed by the at least one computing device, the at least A computing device executes the following steps: acquiring a training image sample set, wherein the text position is marked with a text box in the training image sample; training a text position detection model based on a deep neural network based on the training image sample set, wherein the text
  • the position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used to extract features of the image to generate a feature map
  • the candidate region recommendation layer is used to generate
  • the feature map determines a predetermined number of candidate text regions in the image
  • the cascaded multi-level text box branch is used to predict the candidate horizontal text box based on the feature corresponding to each candidate text region in the feature map
  • the mask branch is used
  • the mask information of the text in the candidate horizontal text box is predicted based on the features corresponding to the candidate horizontal text box in the feature map, and the final text box used to locate the text position in the image is determined according to the predicted mask information.
  • a system including at least one computing device and at least one storage device storing instructions may be provided, wherein the instructions, when executed by the at least one computing device, cause the Said at least one computing device executes the following steps: obtaining a predicted image sample; using a pre-trained deep neural network-based text position detection model to determine a final text box for locating the text position in the predicted image sample, wherein the text
  • the position detection model includes a feature extraction layer, a candidate region recommendation layer, cascaded multi-level text box branches, and mask branches.
  • the feature extraction layer is used to extract features of the predicted image sample to generate a feature map
  • the candidate region recommendation layer is used to Based on the generated feature map, a predetermined number of candidate text regions are determined in the predicted image sample.
  • the cascaded multi-level text box branch is used to predict the candidate horizontal text box based on the feature corresponding to each candidate text region in the feature map.
  • the film branch is used to predict the mask information of the text in the candidate horizontal text box based on the features corresponding to the candidate horizontal text box in the feature map, and determine the location of the text in the predicted image sample according to the predicted mask information The final text box.
  • the above-mentioned system can be deployed in a server or a client, and can also be deployed on a node in a distributed network environment.
  • the system may be a PC computer, a tablet device, a personal digital assistant, a smart phone, a web application, or other devices capable of executing the above set of instructions.
  • the system may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, a mouse, a touch input device, etc.).
  • all components of the system may be connected to each other via a bus and/or a network.
  • the system does not have to be a single system, and may also be any collection of devices or circuits that can execute the above-mentioned instructions (or instruction sets) individually or jointly.
  • the system may also be a part of an integrated control system or a system manager, or may be configured as a portable electronic device interconnected with a local or remote (e.g., via wireless transmission) interface.
  • the at least one computing device may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor.
  • the at least one computing device may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
  • the computing device can run instructions or codes stored in one of the storage devices, where the storage device can also store data. Instructions and data can also be sent and received via a network via a network interface device, wherein the network interface device can use any known transmission protocol.
  • the storage device can be integrated with the computing device, for example, RAM or flash memory is arranged in an integrated circuit microprocessor or the like.
  • the storage device may include an independent device, such as an external disk drive, a storage array, or any other storage device that can be used by a database system.
  • the storage device and the computing device may be operatively coupled, or may communicate with each other through, for example, an I/O port, a network connection, etc., so that the computing device can read the instructions stored in the storage device.
  • the text location detection model includes cascaded multi-level text box branches, and the method and system for training a text detection model according to an exemplary embodiment of the present disclosure Because the size and/or rotation of the training sample set was changed before training, the anchor point box was redesigned, and the difficult sample learning mechanism was added during the training process, the trained text position detection model can provide better Text position detection effect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种在图像中定位文本位置的方法,包括:获取预测图像样本;利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,特征提取层提取预测图像样本的特征以生成特征图,候选区域推荐层基于特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支基于特征图中与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。

Description

文本位置定位方法和系统以及模型训练方法和系统
本申请要求申请号为201910682132.X,申请日为2019年7月26日,名称为“文本位置定位方法和系统以及模型训练方法和系统”的中国专利申请的优先权,其中,上述申请公开的内容通过引用结合在本申请中。
技术领域
本公开总体说来涉及人工智能领域,更具体地,涉及一种在图像中定位文本位置的方法和系统、以及训练文本位置检测模型的方法和系统。
背景技术
图像中的文本蕴含着丰富的信息,提取这些信息(即,文本识别)对图像所处场景的理解等具有重要意义。文本识别分为两个步骤:文本的检测(即,定位文本位置)和文本的识别(即,识别文本的内容),两者缺一不可,而文本检测作为文本识别的前提条件,尤为关键。然而,复杂场景或自然场景下的文本检测效果常因为以下一些难点而使得文本检测效果较差:(1)拍摄角度不一,使文本存在变形的可能;(2)文本存在多个方向,可能存在水平文本和旋转文本;(3)文本尺寸大小不一,紧密程度不一,同一张图像同时存在长文本和短文本,排布紧密或松散。
近些年来,虽然人工智能技术的发展为图像中的文本识别技术提供了有利的技术支持,并且也出现了一些较为优秀的文本检测方法(例如,faster-rcnn、mask-rcnn、east、ctpn、fots、pixel-link等),然而,这些文本检测方法的文本检测效果仍然较差。例如,faster-rcnn、mask-rcnn只支持水平文本的检测,而无法检测旋转文本;east、fots受限于网络的感受野,因此对长文本的检测效果不佳,会出现长文本头尾框不住的现象;ctpn虽然支持旋转文本检测但是旋转文本的检测效果较差;pixel-link遇到文本密集排布现象时,会把多行文本当成一个整体,文本检测效果仍然欠佳。
发明内容
本公开在于至少解决现有文本检测方式中存在的以上难点,以便提高文本位置检测效果。
根据本公开示例性实施例,提供了一种在图像中定位文本位置的方法,所述方法可包括:获取预测图像样本;利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
根据本公开另一示例性实施例,提供了一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的在图像中定位文本位置的方法。
根据本公开另一示例性实施,提供了一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行机器学习建模过程的实现方法的以下步骤:获取预测图像样本;利用预 先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
根据本公开另一示例性实施例,提供了一种在图像中定位文本位置的系统,所述系统可包括:预测图像样本获取装置,被配置为获取预测图像样本;文本位置定位装置,被配置为利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
根据本公开另一示例性实施例,提供了一种训练文本位置检测模型的方法,所述方法可包括:获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;基于训练图像样本集训练基于深度神经网络的文本位置检测模型,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
根据本公开另一示例性实施例,提供了一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的训练文本位置检测模型的方法。
根据本公开另一示例性实施例,提供了一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行机器学习建模过程的实现方法的以下步骤:获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;基于训练图像样本集训练基于深度神经网络的文本位置检测模型,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
根据本公开另一示例性实施例,提供了一种训练文本位置检测模型的系统,所述系统可包括:训练图像样本集获取装置,被配置为获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;模型训练装置,被配置为基于训练图像样本集训练基于深度神经网络的文本位置检测模型,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的 特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
根据本公开示例性实施例的文本位置检测模型包括级联的多级文本框分支,并且根据本公开示例性实施例的训练文本检测模型的方法和系统由于在训练前对训练样本集进行了尺寸和/或旋转变化,重新设计了锚点框,并且在训练过程中加入了难样本学习机制,因此,训练出的文本位置检测模型可提供更佳的文本位置检测效果。
此外,根据本公开示例性实施例的在图像中定位文本位置的方法和系统通过利用包括级联的多级文本框分支的文本位置检测模型,可提高文本检测性能,而且由于引入了两级非极大值抑制操作可有效防止漏检和文本框重叠,使得不仅可以定位水平文本而且可以定位旋转文本,此外,通过对获取的图像进行多尺度变换而针对同一图像的不同尺寸的预测图像样本进行预测并将针对不同尺寸的预测图像样本确定的文本框进行合并,可进一步提高图像中文本位置检测效果。
附图说明
从下面结合附图对本公开实施例的详细描述中,本公开的这些和/或其他方面和优点将变得更加清楚并更容易理解,其中:
图1是示出根据本公开示例性实施例的训练文本位置检测模型的系统的框图;
图2是根据本公开示例性实施例的文本位置检测模型的示意图;
图3是示出根据本公开示例性实施例的训练文本检测模型的方法的流程图;
图4是示出根据本公开示例性实施例的在图像中定位文本位置的系统的框图;
图5是示出根据本公开示例性实施例的在图像中定位文本位置的方法的流程图。
具体实施方式
为了使本领域技术人员更好地理解本公开,下面结合附图和具体实施方式对本公开的示例性实施例作进一步详细说明。在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。在本公开中出现的“和/或”均表示被其连接的前后两项或多项中的至少一项。例如,“包括A和B之中的至少一个”、“包括A和/或B”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如,“执行步骤一和步骤二之中的至少一个”、“执行步骤一和/或步骤二”即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。也就是说,“A和/或B”也可被表示为“A和B之中的至少一个”,“执行步骤一和/或步骤二”也可被表示为“执行步骤一和步骤二之中的至少一个”。
图1是示出根据本公开示例性实施例的训练文本位置检测模型的系统(在下文中,为描述方便,将其简称为“模型训练系统”)100的框图。
如图1所示,模型训练系统100可包括训练图像样本集获取装置110和模型训练装置120。
具体地,训练图像样本集获取装置110可获取训练图像样本集。这里,在训练图像样本集的训练图像样本中对文本位置进行了文本框标记,即,在图像中用文本框标记出了文本位置。作为示例,训练图像样本集获取装置110可直接从外部获取由其他装置产生的训练图像样本集,或者,训练图像样本集获取装置110可本身执行操作来构建训练图像样本集。例如,训练图像样本集获取装置110可通过手动、半自动或全自动的方式来获取训练图像样本集,并将获取的训练图像样本处理为适当的格式或形式。这里,训练图像样本 集获取装置110可通过输入装置(例如,工作站)接收用户手动导入的训练图像样本集,或者训练图像样本集获取装置110可通过全自动的方式从数据源获取训练图像样本集,例如,通过以软件、固件、硬件或其组合实现的定时器机制来系统地请求数据源将训练图像样本集发送给训练图像样本集获取装置110,或者,也可在有人工干预的情况下自动进行训练图像样本集的获取,例如,在接收到特定的用户输入的情况下请求获取训练图像样本集。当获取到训练图像样本集时,优选地,训练图像样本集获取装置110可将获取的样本集存储在非易失性存储器(例如,数据仓库)中。
模型训练装置120可基于训练图像样本集训练基于深度神经网络的文本位置检测模型。这里,深度神经网络可以是卷积神经网络,但不限于此。
图2示出根据本公开示例性实施例的文本位置检测模型的示意图。如图2所示,文本位置检测模型可包括特征提取层210、候选区域推荐层220、级联的多级文本框分支230(为方便示意,图2中将多级文本框分支示意为包括三级文本框分支,但这仅是示例,级联的多级文本框分支不限于仅包括三级文本框分支)以及掩膜分支240。具体地,特征提取层可用于提取图像的特征以生成特征图,候选区域推荐层可用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支可用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支可用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。这里,所述最终的文本框可包括水平文本框和/或旋转文本框。也就是说,本公开的文本检测模型既可以检测水平文本,也可检测旋转文本。
在此需要说明的是,本公开中的所述最终的文本框可包括水平文本框和/或旋转文本框,还可以表述为:所述最终的文本框可包括水平文本框和旋转文本框之中的至少一个。
作为示例,图2的文本位置检测模型可基于Mask-RCNN框架,此时,特征提取层可对应于Mask-RCNN框架中的深度残差网络(例如,resnet 101),候选区域推荐层可对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支可包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。本领域技术人员均清楚Mask-RCNN框架中的深度残差网络、RPN层、RolAlign层和全连接层的功能和操作,因此,这里不对其进行详细介绍。
本领域技术人员均了解,传统的Mask-RCNN框架不仅只包括一个文本框分支,而且在RPN层确定了预定数量个候选区域(例如,2000个)之后,从这些候选区域中随机抽样一些候选区域(例如,512个),并将抽样的候选区域分别送给文本框分支和掩膜分支。然而,这样的结构以及随机抽样候选区域分别送给文本框分支和掩膜分支的操作导致传统Mask-RCNN框架的文本位置检测效果较差。这是因为,一级文本框分支仅能检测与真实文本框标记的重叠度在一定范围内的候选区域,而随机抽样不利于模型对难样本的学习,比如,如果2000个候选区域存在大量简单样本,较少难样本,则随机抽样会较大概率把一些简单样本送给文本框分支和掩膜分支,从而导致模型学习效果较差。针对此,本公开提出的上述包括多级文本框分支并且将多级文本框分支点的输出作为掩膜分支的输入的构思可有效地提高文本位置检测效果。
下面,将对本公开的文本位置检测模型的训练进行详细描述。
如本公开背景技术中所描述的,自然场景中由于图像拍摄角度不一,会存在文本变形的可能,并且可能存在平面旋转和三维立体旋转,因此,根据本公开示例实施例,模型训练系统100除了包括训练图像样本集获取装置110和模型训练装置120之外,还可包括预处理装置(未示出)。这里,预处理装置可在基于训练图像样本集训练所述文本位置检测模型之前,对训练图像样本集中的训练图像样本进行尺寸变换和/或透射变换以获得变换后的训练图像样本集,从而使得训练图像样本更切近真实场景。具体而言,预处理装置可在不保持训练图像样本的原始宽高比的情况下,对训练图像样本进行随机的尺寸变换使 得训练图像样本的宽和高在预定范围内。这里,之所以不保持训练图像样本的原始宽高比就是为了模拟真实场景中的压缩和拉伸。例如,可将训练图像样本的宽和高随机变换到640至2560个像素之间,但是预定范围不限于此。此外,对训练图像样本进行透射变换可以包括使训练图像样本中像素的坐标分别绕x轴、y轴和z轴进行随机旋转。例如,可以将训练图像样本中的每个像素绕x轴随机旋转(-45,45),绕y轴随机旋转(-45,45),绕z轴随机旋转(-30,30),增强后的训练图像样本将更加符合真实场景。例如,可通过下面的等式对文本框坐标进行变换:
Figure PCTCN2020103799-appb-000001
其中,
Figure PCTCN2020103799-appb-000002
矩阵,θ x为绕x轴随机旋转(-45,45),θ y为绕y轴随机旋转(-45,45),θ z为绕z轴随机旋转(-30,30)得到,
Figure PCTCN2020103799-appb-000003
为变换前的坐标,通常z的取值为1,
Figure PCTCN2020103799-appb-000004
为变换后的坐标,变换后的文本框坐标可表示为x=x′/z′,y=y′/z′。
在此需要说明的是,本公开中的对训练图像样本集中的训练图像样本进行尺寸变换和/或透射变换以获得变换后的训练图像样本集,还可以表述为:对训练图像样本集中的训练图像样本进行尺寸变换和透射变换之中的至少一个,以获得变换后的训练图像样本集。
在预处理装置对训练图像样本集进行变换之后,模型训练装置120可基于变换后的训练图像样本集训练上述文本检测模型。具体地,模型训练装置120可以进行以下操作来训练上述文本检测模型:将经过变换的训练图像样本输入上述文本位置检测模型;利用特征提取层提取输入的训练图像样本的特征以生成特征图;利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域;利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测候选水平文本框,并根据文本框分支的预测结果和文本框标记来计算与每个候选文本区域对应的文本框预测损失;将所述预定数量个候选文本区域按照其对应的文本框预测损失进行排序,并根据排序结果筛选出文本框预测损失最大的前特定数量个的候选文本区域;利用掩膜分支基于特征图中与筛选出的候选文本区域对应的特征来预测筛选出的候选文本区域中的掩膜信息,并通过比较预测出的掩膜信息与文本的真实掩膜信息来计算掩膜预测损失;通过使文本框预测损失和掩膜预测损失的总和最小来训练文本位置检测模型。
作为示例,图像的特征可以包括图像中像素的相关度,但不限于此。模型训练装置120可利用特征提取层提取训练图像样本中像素的相关度来生成特征图。随后,模型训练装置120可利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域。这里,由于预测出的初始候选文本区域可能会存在彼此重叠的现象,因此,本公开利用非极大值抑制操作来对初始候选文本区域进行筛选。下面,简要地对非极大值抑制操作进行描述。具体地,可从与锚点框的差异最小的初始候选文本区域开始,分别判断其他初始候选文本框与该初始候选文本区 域的重叠度是否大于某个设定的阈值,如果存在大于该阈值的初始候选文本区域则将其去除,也就是说,保留重叠度小于该阈值的初始候选文本区域。然后,再在所有保留下来的初始候选文本区域之中再选择一个与锚点框的差异最小的初始候选文本区域,并继续判断该初始候选文本区域与其他初始候选文本区域的重叠度,如果重叠度大于阈值则删除,否则保留,直至筛选出预定数量个候选文本区域。
这里,预先设置的锚点框是预先设置的图像中每个可能的文本框,以用于与真实文本框进行匹配。传统的基于Mask-RCNN框架的模型的锚点的宽高比集合是固定的,该集合为[0.5,1,2],也就是说,锚点的宽高比仅有0.5、1和2这三种。利用这三种宽高比的锚点在一些通用的目标检测数据集(例如,coco数据集)上基本能够覆盖目标,但是,在文本场景中确远远不足以覆盖文本。这是因为,文本场景中宽高比范围很大,1:5,5:1的文本很常见,如果用传统Mask-RCNN的仅具有三种固定宽高比的锚点框会导致锚点框和真实的文本框匹配不上,从而导致文本漏检。因此,根据本公开示例性实施例,模型训练装置120还可在训练所述文本位置检测模型之前,统计变换后的训练图像样本集中标记的所有文本框的宽高比,并且根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合。也就是说,本公开可对锚点框的宽高比进行重新设计。具体地,例如,在统计了变换后的训练图像样本集中标记的所有文本框的宽高比之后,可将统计的所有文本框的宽高比进行排序,根据排序后的宽高比确定锚点框的宽高比的上限值和下限值,在上限值和下限值之间等比例地进行插值,并将由上限值和下限值以及通过插值得到的值构成的集合作为所述锚点框的宽高比集合。例如,可以将所有文本框的宽高比由小到大排序后处于第5%的宽高比和处于第95%的宽高比分别确定为锚点框的宽高比的下限值和上限值,然后在上限值和下限值之间等比例地进行三次插值来得到另外三个宽高比,并将由上限值和下限值以及通过插值得到的三个值构成的集合作为锚点框的宽高比集合。然而,以上确定锚点框的宽高比集合的方式仅是示例,上限值和下限值的选取方式以及插值的方式和次数均不限于以上示例。通过根据以上方式设计锚点框的宽高比集合,可以有效地减少文本框的漏检。
如上所述,在确定了预定数量个候选文本区域之后,模型训练装置120可利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测每个候选文本区域与文本框标记之间的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失。作为示例,如图2所示,所述级联的多级文本框分支可以是三级文本框分支,但不限于此。
另外,如上所述,本公开提出了难样本学习机制,也就是说,将所述预定数量个候选文本区域按照其对应的文本框预测损失进行排序,根据排序结果筛选出文本框预测损失最大的前特定数量个的候选文本区域,并将筛选出的候选文本区域输入掩膜分支进行掩膜信息预测。例如,可根据文本框预测损失从2000个候选区域中选出文本框预测损失较大的512个候选文本区域。为此,模型训练装置120可根据利用文本框分支预测的位置偏差和置信度来计算与每个候选文本区域对应的文本框预测损失。具体而言,例如,针对每个候选文本区域,模型训练装置120可分别根据每一级文本框分支的预测结果和真实文本框标记来计算每一级文本框分支的文本框预测损失,并通过将各级文本框分支的文本框预测损失求和来确定与每个候选文本区域对应的文本框预测损失。这里,文本框预测损失包括与每个候选文本区域对应的置信度预测损失和位置偏差预测损失。此外,针对每一级文本框分支设置的用于计算每一级文本框分支的文本框预测损失的重叠度阈值彼此不同,并且针对前一级文本框分支设置的重叠度阈值小于针对后一级文本框分支设置的重叠度阈值。这里,重叠度阈值是每一级文本框分支预测出的水平文本框与文本框标记之间的重叠度阈值。重叠度(IOU)可以是两个文本框之间的交集除以两个文本框的并集所获得的值。例如,在所述多级文本框分支是三级文本框分支的情况下,针对第一级文本框分支至第三级文本框分支设置的重叠度阈值可以分别是0.5、0.6和0.7。具体地,例如,在计算第一级 文本框预测损失时,如果针对候选文本区域预测出的水平文本框与训练图像样本中的文本框标记之间的重叠度阈值大于0.5,则该候选文本区域被确定为是针对第一级文本框分支的正样本,小于0.5则被确定为是负样本。但是当阈值取0.5时会有较多的误检,因为0.5的阈值会使得正样本中有较多的背景,这是较多文本位置误检的原因。如果用0.7的重叠度阈值,则可以减少误检,但检测效果不一定最好,主要原因在于重叠度阈值越高,正样本的数量就越少,因此过拟合的风险就越大。然而,本公开由于采取级联的多级文本框分支,并且针对每一级文本框分支设置的用于计算每一级文本框分支的文本框预测损失的重叠度阈值彼此不同,而且针对前一级文本框分支设置的重叠度阈值小于针对后一级文本框分支设置的重叠度阈值,因此能够让每一级文本框分支都专注于检测与真实文本框标记重叠度在某一范围内的候选文本区域,因此文本检测效果会越来越好。
在筛选出文本框预测损失较大的候选文本区域之后,模型训练装置120可利用掩膜分支基于特征图中与筛选出的候选文本区域对应的特征来预测筛选出的候选文本区域中的掩膜信息(具体地,可将预测为文本的像素的掩膜设置为1,不是文本的像素的掩膜设置为0),并通过比较预测出的掩膜信息与文本的真实掩膜信息来计算掩膜预测损失。具体地,例如,模型训练装置120可利用筛选出的候选文本区域内的像素之间的相关度来预测掩膜信息。这里,可以默认认为文本框标记中的像素的掩膜值均为1,并且将其作为真实掩膜信息。模型训练装置120可通过不断利用训练图像样本对文本位置检测模型进行训练,直至使所有的文本框预测损失和掩膜预测损失的总和最小,从而完成文本位置检测模型的训练。
以上,已经参照图1和图2对根据本公开示例性实施例的模型训练系统和文本位置检测模型进行了描述。由于本公开的文本位置检测模型包括级联的多级文本框分支,并且在训练前对训练样本集进行了尺寸和/或旋转变化,重新设计了锚点框,并且在训练过程中加入了难样本学习机制,因此,训练出的文本位置检测模型可提供更佳的文本位置检测效果。
在此需要说明的是,本公开中的在训练前对训练样本集进行了尺寸和/或旋转变化,还可以表述为:在训练前对训练样本集进行了尺寸变化和旋转变化之中的至少一个。
需要说明的是,尽管以上在描述模型训练系统100时将其划分为用于分别执行相应处理的装置(例如,训练图像样本集获取装置110和模型训练装置120),然而,本领域技术人员清楚的是,上述各装置执行的处理也可以在模型训练系统100不进行任何具体装置划分或者各装置之间并无明确划界的情况下执行。此外,以上参照图1所描述的模型训练系统100并不限于包括以上描述的装置,而是还可以根据需要增加一些其他装置(例如,存储装置、数据处理装置等),或者以上装置也可被组合。
图3是示出根据本公开示例性实施例的训练文本位置检测模型的方法(以下,为描述方便,将其简称为“模型训练方法”)的流程图。
这里,作为示例,图3所示的模型训练方法可由图1所示的模型训练系统100来执行,也可完全通过计算机程序或指令以软件方式实现,还可通过特定配置的计算系统或计算装置来执行,例如,可通过包括至少一个计算装置和至少一个存储指令的存储装置的系统来执行,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行上述模型训练方法。为了描述方便,假设图3所示的模型训练方法由图1所示的模型训练系统100来执行,并假设模型训练系统100可具有图1所示的配置。
参照图3,在步骤S310,训练图像样本集获取装置110可获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记。接下来,在步骤S320,模型训练装置120可基于训练图像样本集训练基于深度神经网络的文本位置检测模型。如参照图2所述,文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特 征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。作为示例,文本位置检测模型可基于Mask-RCNN框架,特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。此外,图像的特征可包括图像中像素的相关度,但不限于此。这里,最终的文本框可包括水平文本框和/或旋转文本框。
根据示例性实施例的模型训练方法还可在步骤S310和步骤S320之间包括对获取的训练图像样本集进行变换的步骤(未示出)。具体地,可在基于训练图像样本集训练所述文本位置检测模型之前(即,在步骤S320之前),对训练图像样本集中的训练图像样本进行尺寸变换和/或透射变换以获得变换后的训练图像样本集。以上,已经参照图1对如何对训练图像样本进行尺寸变换和透射变换进行了描述,详细细节可参照图1的描述,这里不再赘述。
在对训练图像样本集进行变换之后,在步骤S320,模型训练装置120可执行以下操作来训练文本位置检测模型:将经过变换的训练图像样本输入所述文本位置检测模型;利用特征提取层提取输入的训练图像样本的特征以生成特征图;利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域;利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测每个候选文本区域与文本框标记之间的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失;将所述预定数量个候选文本区域按照其对应的文本框预测损失进行排序,并根据排序结果筛选出文本框预测损失最大的前特定数量个的候选文本区域;利用掩膜分支基于特征图中与筛选出的候选文本区域对应的特征来预测筛选出的候选文本区域中的掩膜信息,并通过比较预测出的掩膜信息与文本的真实掩膜信息来计算掩膜预测损失;通过使文本框预测损失和掩膜预测损失的总和最小来训练文本位置检测模型。
在利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域时,模型训练装置120可利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域。相应地,图3所示的模型训练方法还可包括设置锚点框的步骤(未示出),例如,该步骤可包括:在训练所述文本位置检测模型之前,统计变换后的训练图像样本集中标记的所有文本框的宽高比,并且根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合。此外,该步骤还可包括:根据统计的文本框的大小设置锚点框的大小,或者将锚点框的大小设置为固定的一些大小,例如,16×16、32×32、64×64、128×128和256×256,本公开对锚点框的大小或设置锚点框大小的方式并未限制,这是因为,一般对于文本位置检测而言,锚点框宽高比的设置对于文本检测效果的影响更大。
作为示例,可通过以下操作来设置所述锚点框的宽高比集合:将统计的所有文本框的宽高比进行排序;根据排序后的宽高比确定所述锚点框的宽高比的上限值和下限值,在上限值和下限值之间等比例地进行插值,并将由上限值和下限值以及通过插值得到的值构成的集合作为所述锚点框的宽高比集合。
根据示例性实施例,所述级联的多级文本框分支可以是三级文本框分支,但不限于此。另外,关于如何根据预测的位置偏差和置信度来计算与每个候选文本区域对应的文本框预测损失的操作以及针对每一级文本框分支设置用于计算每一级文本框分支的文本框预测损失的重叠度阈值的相关描述也可参照图1的相应描述,这里不再赘述。事实上,由于图3所示的模型训练方法由图1所述的模型训练系统100执行,因此,以上参照图1 在描述模型训练系统中包括的各个装置时所提及的内容均适用于这里,故关于以上步骤中所涉及的相关细节,可参见图1的相应描述,这里均不再赘述。
以上描述的根据示例性实施例的模型训练方法由于文本位置检测模型包括级联的多级文本框分支,并且在训练前对训练样本集进行了尺寸和/或旋转变化,重新设计了锚点框,并且在训练过程中加入了难样本学习机制,因此,利用上述模型训练方法训练出的文本位置检测模型可提供更佳的文本位置检测效果。
在下文中,将参照图4和图5对利用上述训练出的文本位置检测模型在图像中定位文本位置的过程进行描述。
图4是示出根据本公开示例性实施例的在图像中定位文本位置的系统(以下,为描述方便,将其简称为“文本定位系统”)400的框图。
参照图4,文本定位系统400可包括预测图像样本获取装置410和文本位置定位装置420。具体地,预测图像样本获取装置410可被配置为获取预测图像样本,文本位置定位装置420可被配置为利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框。这里,文本位置检测模型可包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。作为示例,预测图像样本的特征可预测图像样本中像素的相关度,但不限于此。此外,作为示例,文本位置检测模型可以基于Mask-RCNN框架,并且特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支可以包括一系列卷积层。以上参照图2关于文本位置检测模型的描述均适应于这里,这里不再赘述。
由于同一张图像中可能同时存在长文本和短文本,而如果始终将图像放大或缩小到一定尺寸后输入文本位置检测模型,则可能不能够同时较好地检测到长文本和短文本。这是因为,如果将图像放大到较大尺寸,则短文本的检测性能较好,而如果将图像缩小到较小尺寸,则长文本的检测性能较好。因此,在本公开中,对图像进行多尺度预测。具体地,预测图像样本获取装置410可首先获取图像,然后对获取的图像进行多尺度缩放来获取与所述图像对应的不同尺寸的多个预测图像样本。随后,文本位置定位装置420可针对不同尺寸的多个预测图像样本分别利用预先训练的文本位置检测模型来确定用于在预测图像样本中定位文本位置的最终的文本框,最后,将针对每种尺寸的预测图像样本确定的文本框进行合并来得到最终的结果。这里,图像可来源于任何数据源,本公开对图像的来源、图像的具体获取方式等均无限制。
针对每种尺寸的预测图像样本,文本位置定位装置420可通过执行以下操作来确定用于在预测图像样本中定位文本位置的最终的文本框:利用特征提取层提取预测图像样本的特征以生成特征图;利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域;利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框,并且通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框;利用掩膜分支,基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,根据预测出的文本的掩膜信息确定初选文本框,并且通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框,其中,第一重合度阈值大于第二重合度阈值。
接下来,文本位置定位装置420可将针对不同尺寸的预测图像样本确定的文本框进行合并。具体地,针对第一尺寸的预测图像样本,文本位置定位装置420可在利用所述文本位置检测模型确定了用于在第一尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸大于第一阈值的第一文本框,并且针对第二尺寸的预测图像样本,在利用所述文本位置检测模型确定了用于在第二尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸小于第二阈值的第二文本框,其中,第一尺寸小于第二尺寸。也就是说,在合并的时候,对于较大尺寸的图像预测样本,保留小尺寸的文本框,而对于较小尺寸的图像预测样本,保留大尺寸的文本框。例如,如果先前获取的预测图像样本的尺寸分别是800像素大小和1600像素大小,则在将800像素大小和1600像素大小的预测图像样本分别输入文本位置检测模型而分别得到在预测图像样本中定位文本位置的文本框之后,对于800像素大小的预测图像样本,文本位置定位装置420可保留相对大的文本框而过滤掉相对小的文本框(具体地可通过以上提及的第一阈值的设置来进行保留),然而,对于1600像素大小的预测图像样本,文本位置定位装置420可保留相对小的文本框而过滤掉相对大的文本框(具体地,可通过以上提及的第二阈值的设置来进行保留)。接下来,文本位置定位装置420可将过滤后的结果进行合并。具体地,文本位置定位装置420可利用第三非极大值抑制操作对选择的第一文本框和第二文本框进行筛选,以得到用于在所述图像中定位文本位置的最终的文本框。例如,文本位置定位装置420可将所有选择的第一文本框和第二文本框按照其置信度进行排名并选择置信度最大的一个文本框,然后计算其余文本框与该文本框的重叠度,如果重叠度大于阈值则删除,否则保留,而最终保留的文本框即为在图像中定位文本位置的最终的文本框。
下面,具体地对文本位置定位装置420针对每个预测图像样本执行的操作所涉及的一些细节进行描述。需要说明的是,在接下来的描述中,为了避免对公知的功能和结构的描述会用不必要的细节模糊本公开的构思,因此将省略对公知的功能、结构和术语的描述。
首先,如上所述,为了确定在预测图像样本中定位文本位置的文本框,文本定位装置420可利用特征提取层提取预测图像样本的特征以生成特征图,具体地,例如可以利用Mask-RCNN框架中的深度残差网络(例如,resnet101)提取预测图像样本的像素之间的相关度作为特征来生成特征图。然而,本公开对所使用的预测图像样本的特征以及具体的特征提取方式并无任何限制。
接下来,文本位置定位装置420可利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域,例如,文本位置定位装置420可利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用第四非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域。这里,所述锚点框的宽高比可以是以上描述的通过在所述文本位置检测模型的训练阶段对训练图像样本集中所标记的文本框的宽高比进行统计而确定的。利用非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域的具体细节已经在参照图1的描述中提及,因此,这里不再赘述。
随后,文本位置定位装置420可利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框,并且通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框。作为示例,所述级联的多级文本框分支可以是三级文本框分支,下面,以三级文本框为例对利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框进行描述。
具体地,文本位置定位装置420可首先利用第一级文本框分支,从特征图中提取与每个候选文本区域对应的特征并预测每个候选文本区域与真实文本区域的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并且根据第一级文本框分支的预测结果确定第一级水平文本框。例如,文本位置定位装置420可利用第一级文本框分 支中的RolAlign层从特征图中提取与每个候选文本区域对应的特征,并利用第一级文本框分支中的全连接层预测每个候选文本区域与真实文本区域的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度。然后,文本位置定位装置420可根据预测的置信度去除部分置信度较低的候选文本区域,并根据保留的候选文本区域及其与真实文本区域的位置偏差确定第一级水平文本框。
在确定了第一级水平文本框之后,文本位置定位装置420可利用第二级文本框分支,从特征图中提取与第一级水平文本框对应的特征并预测第一级水平文本框与真实文本区域的位置偏差以及第一级水平文本框包括文本的置信度和不包括文本的置信度,并根据第二级文本框分支的预测结果确定第二级水平文本框。同样地,例如,文本位置定位装置420可利用第二级文本框分支中的RolAlign层从特征图中提取与第一级水平文本框对应的特征(即,提取与第一级水平文本框中的像素区域对应的特征),并利用第二级文本框分支中的全连接层预测第一级水平文本框与真实文本区域的位置偏差以及第一级水平文本框包括文本的置信度和不包括文本的置信度。然后,文本位置定位装置420可根据预测的置信度去除部分置信度较低的第一级水平文本框,并根据保留的第一级水平文本框及其与真实文本区域的位置偏差确定第二级水平文本框。
在确定了第二级水平文本框之后,文本位置定位装置420可利用第三级文本框分支,从特征图中提取与第二级水平文本框对应的特征并预测第二级水平文本框与真实文本区域的位置偏差以及第二级水平文本框包括文本的置信度和不包括文本的置信度,并根据第三级文本框分支的预测结果确定初始候选水平文本框。同样地,例如,文本位置定位装置420可利用第三级文本框分支中的RolAlign层从特征图中提取与第二级水平文本框对应的特征(即,提取与第二级水平文本框中的像素区域对应的特征),并利用第三级文本框分支中的全连接层预测第二级水平文本框与真实文本区域的位置偏差以及第二级水平文本框包括文本的置信度和不包括文本的置信度。然后,文本位置定位装置420可根据预测的置信度去除部分置信度较低的第二级水平文本框,并根据保留的第二级水平文本框及其与真实文本区域的位置偏差确定初始候选水平文本框。
如上所述,在预测出初始候选水平文本框之后,文本位置定位装置420可通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框。具体地,文本位置定位装置420可首先根据初始候选水平文本框的置信度选择置信度最大的初始候选水平文本框,然后计算其余初始候选水平文本框与置信度最大的初始候选水平文本框的文本框重合度,如果文本框重合度小于第一重合度阈值则保留,否则删除。所有保留的水平文本框被作为候选水平文本框输入掩膜分支。
接下来,文本位置定位装置420可利用掩膜分支,基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息。具体地,例如,文本位置定位装置420可基于特征图中与候选水平文本框中的像素对应的像素相关度特征来预测候选水平文本框中的文本的掩膜信息。随后,文本位置定位装置420可根据预测出的文本的掩膜信息确定初选文本框。具体而言,例如,文本位置定位装置420可根据预测出的文本的掩膜信息确定包含文本的最小外接矩形,并将确定的最小外接矩形作为初选文本框。例如,文本位置定位装置420可根据预测出的文本的掩膜信息使用最小外接矩形函数确定包含文本的最小外部矩形。
在确定了初选文本框之后,文本位置定位装置420可通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框。具体地,例如,文本位置定位装置420可首先根据初始候选水平文本框的置信度选择置信度最大的初始候选水平文本框,然后计算其余初始候选水平文本框与置信度最大的初始候选水平文本框的文本框重合度,如果文本框重合度小于第一重合度阈值则保留,否则删除。
需要说明的是,以上提及的第一重合度阈值大于第二重合度阈值。传统的 Mask-RCNN框架中只有一级非极大值抑制,并且重合度阈值被固定设置为0.5,也就是说,在筛选时会删除重合度高于0.5的水平文本框。然而,对于旋转角度较大的密集文字,如果重合度阈值设置为0.5,则会导致部分文本框的漏检。而如果提高重合度阈值(例如,将重合度阈值设置为0.8,即,删除重合度高于0.8的文本框),则会导致最后预侧的水平文本框重叠较多。针对此,本公开提出了两级非极大值抑制的构思。即,如上所述,在利用级联的多级文本框分支预测出初始候选水平文本框,先通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框。随后,在利用掩膜分支预测出候选水平文本框中的文本的掩膜信息并根据预测出的文本的掩膜信息确定了初选文本框之后,通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框。而通过将第一重合度阈值大于第二重合度阈值(例如,第一重合度阈值可设置为0.8,第二重合度阈值可设置为0.2),可实现先利用第一非极大值抑制操作对通过级联的多级文本框分支确定的文本框进行粗筛,然后,利用第二非极大值抑制操作对通过掩膜分支确定的文本框进行细筛。最终,经过两级非极大值抑制操作和调整两级非极大值抑制操作所使用的重合度阈值,不仅可以定位水平文本而且可以定位旋转文本。
此外,图4所示的文本定位系统400还可以包括显示装置(未示出)。显示装置可在所述图像上显示用于在所述图像中定位文本位置的最终的文本框,从而可方便用户直观地确定文本的位置。这里,所述最终的文本框包括水平文本框和/或旋转文本框。
根据示例性实施例的文本定位系统通过利用包括级联的多级文本框分支的文本位置检测模型,可提高文本检测性能,而且由于引入了两级非极大值抑制操作可有效防止漏检和文本框重叠,使得不仅可以定位水平文本而且可以定位旋转文本。此外,通过对获取的图像进行多尺度变换之后针对同一图像的不同尺寸的预测图像样本进行预测并将针对不同尺寸的预测图像样本确定的文本框进行合并,可进一步提高文本位置检测效果,使得即使在图像中同时存在不同尺寸的文本时,也可提供较好的文本位置检测效果。
另外,需要说明的是,尽管以上在描述文本定位系统400时将其划分为用于分别执行相应处理的装置(例如,预测图像样本获取装置410和文本位置定位装置420),然而,本领域技术人员清楚的是,上述各装置执行的处理也可以在文本定位系统400不进行任何具体装置划分或者各装置之间并无明确划界的情况下执行。此外,以上参照图4所描述的文本定位系统400并不限于包括以上描述的预测图像样本获取装置410、文本位置定位装置420和显示装置,而是还可以根据需要增加一些其他装置(例如,存储装置、数据处理装置等),或者以上装置也可被组合。而且,作为示例,以上参照图1描述的模型训练系统100和文本定位系统400也可被组合为一个系统,或者它们可以是彼此独立的系统,本公开对此并无限制。
图5是示出根据本公开示例性实施例的在图像中定位文本位置的方法(以下,为描述方便,将其简称为“文本定位方法”)的流程图。
这里,作为示例,图5所示的文本定位方法可由图4所示的文本定位系统400来执行,也可完全通过计算机程序或指令以软件方式实现,还可通过特定配置的计算系统或计算装置来执行,例如,可通过包括至少一个计算装置和至少一个存储指令的存储装置的系统来执行,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行上述文本定位方法。为了描述方便,假设图5所示的文本定位方法由图4所示的文本定位系统400来执行,并假设文本定位系统400可具有图4所示的配置。
参照图5,在步骤S510,预测图像样本获取装置410可获取预测图像样本。例如,在步骤S510,预测图像样本获取装置410可首先获取图像,然后对获取的图像进行多尺度缩放来获取与所述图像对应的不同尺寸的多个预测图像样本。
接下来,在步骤S520,文本位置定位装置420可利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框。这里,所 述文本位置检测模型可包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支。具体地,特征提取层可用于提取预测图像样本的特征以生成特征图,候选区域推荐层可用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支可用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支可用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。作为示例,文本位置检测模型可基于Mask-RCNN框架,特征提取层可对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层可对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支可包括Mask-RCNN框架中的RolAlign层和全连接层,并且掩膜分支可包括一系列卷积层。此外,以上提及的预测图像样本的特征可包括预测图像样本中像素的相关度,但不限于此。
具体地,在步骤S520,文本位置定位装置420可首先利用特征提取层提取预测图像样本的特征以生成特征图,并利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域。然后,文本位置定位装置420可利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框,并且通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框。接下来,文本位置定位装置420可利用掩膜分支,基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,根据预测出的文本的掩膜信息确定初选文本框,并且通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框。这里,第一重合度阈值大于第二重合度阈值。
在获取了同一图像的不同尺寸的多个预测图像样本,并对每个尺寸的预测图像样本分别执行以上操作之后,根据本公开示例性实施例的文本定位方法还可包括对针对每个尺寸的预测图像样本的预测结果进行合并的步骤(未示出)。例如,在该步骤中,针对第一尺寸的预测图像样本,文本位置定位装置420可在利用所述文本位置检测模型确定了用于在第一尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸大于第一阈值的第一文本框,并且针对第二尺寸的预测图像样本,文本位置定位装置420可在利用所述文本位置检测模型确定了用于在第二尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸小于第二阈值的第二文本框,其中,第一尺寸小于第二尺寸。随后,在该步骤中,文本位置定位装置420可利用第三非极大值抑制操作对选择的第一文本框和第二文本框进行筛选,以得到用于在所述图像中定位文本位置的最终的文本框。
在以上步骤S520的描述中提及文本位置定位装置420可利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域。具体地,例如,文本位置定位装置520可利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用第四非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域。这里,所述锚点框的宽高比可以是通过在所述文本位置检测模型的训练阶段(以上参照图1和图3描述了文本位置检测模型的训练)对训练图像样本集中所标记的文本框的宽高比进行统计而确定的。
作为示例,以上提及的级联的多级文本框分支可以是三级文本框分支。为方便描述,以三级文本框分支为例,对在步骤S520的描述中提及的利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框的操作进行简要描述。具体地,文本位置定位装置420可利用第一级文本框分支,从特征图中提取与每个候选文本区域对应的特征并预测每个候选文本区域与真实文本区域的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并且根据第一级文本框分支的预测结果确定第一级水平文本框;随后,文本位置定位装置420可利用第二级文本框分支,从 特征图中提取与第一级水平文本框对应的特征并预测第一级水平文本框与真实文本区域的位置偏差以及第一级水平文本框包括文本的置信度和不包括文本的置信度,并根据第二级文本框分支的预测结果确定第二级水平文本框;最后,文本位置定位装置420可利用第三级文本框分支,从特征图中提取与第二级水平文本框对应的特征并预测第二级水平文本框与真实文本区域的位置偏差以及第二级水平文本框包括文本的置信度和不包括文本的置信度,并根据第三级文本框分支的预测结果确定初始候选水平文本框。
此外,在以上对步骤S520的描述中提及根据预测出的文本的掩膜信息确定初选文本框。具体地,文本位置定位装置420可根据预测出的文本的掩膜信息确定包含文本的最小外接矩形,并将确定的最小外接矩形作为初选文本框。
如以上参照图4所述,文本定位系统400还可包括显示装置,相应地,图5所示的文本定位方法在步骤S5290之后,可包括在所述图像上显示用于在所述图像中定位文本位置的最终的文本框。这里,所述最终的文本框可包括水平文本框和/或旋转文本框。
由于图5所示的文本定位方法可由图4所示的文本定位系统400来执行,因此,关于以上步骤中所涉及的相关细节,可参见关于图4的相应描述,这里不再赘述。
根据示例性实施例的文本定位方法通过利用包括级联的多级文本框分支的文本位置检测模型,可提高文本位置检测性能,而且由于引入了两级非极大值抑制操作可有效防止漏检和文本框重叠,使得不仅可以定位水平文本而且可以定位旋转文本。此外,通过对获取的图像进行多尺度变换而针对同一图像的不同尺寸的预测图像样本进行预测并将针对不同尺寸的预测图像样本确定的文本框进行合并,可进一步提高文本位置检测效果。
以上已参照图1至图5描述了根据本公开示例性实施例模型训练系统和模型训练方法以及文本定位系统和文本定位方法。
然而,应理解的是:图1和图4所示出的系统及其装置可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些系统或装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,这些系统或装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
此外,上述方法可通过记录在计算机可读存储介质上的指令来实现,例如,根据本公开的示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行以下步骤:获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;基于训练图像样本集训练基于深度神经网络的文本位置检测模型,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
此外,根据本公开的另一示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行以下步骤:获取预测图像样本;利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜 信息确定用于在预测图像样本中定位文本位置的最终的文本框。
上述计算机可读存储介质中存储的指令可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,应注意,所述指令还可在执行上述步骤时执行更为具体的处理,这些进一步处理的内容已经在参照图3和图5描述的过程中提及,因此这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的模型训练系统和文本定位系统可完全依赖计算机程序或指令的运行来实现相应的功能,即,各个装置在计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。
另一方面,当图1和图4所示的系统和装置以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得至少一个处理器或至少一个计算装置可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,根据本公开示例性实施例,可提供一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行下述步骤:获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;基于训练图像样本集训练基于深度神经网络的文本位置检测模型,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
例如,根据本公开另一示例性实施例,可提供一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行下述步骤:获取预测图像样本;利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
具体说来,上述系统可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点上。此外,所述系统可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。此外,所述系统还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。另外,所述系统的所有组件可经由总线和/或网络而彼此连接。
这里,所述系统并非必须是单个系统,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。所述系统还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述系统中,所述至少一个计算装置可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,所述至少一个计算装置还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。计算装置可运行存储在存储装置之一中的指令或代码,其中,所述存储装 置还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储装置可与计算装置集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储装置可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储装置和计算装置可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得计算装置能够读取存储在存储装置中的指令。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。
工业实用性
在本公开提供的文本位置定位方法和系统以及模型训练方法和系统中,文本位置检测模型包括级联的多级文本框分支,并且根据本公开示例性实施例的训练文本检测模型的方法和系统由于在训练前对训练样本集进行了尺寸和/或旋转变化,重新设计了锚点框,并且在训练过程中加入了难样本学习机制,因此,训练出的文本位置检测模型可提供更佳的文本位置检测效果。

Claims (42)

  1. 一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行机器学习建模过程的实现方法的以下步骤:
    获取预测图像样本;
    利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
  2. 如权利要求1所述的系统,其中,利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框的步骤包括:
    利用特征提取层提取预测图像样本的特征以生成特征图;
    利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域;
    利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框,并且通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框;
    利用掩膜分支,基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,根据预测出的文本的掩膜信息确定初选文本框,并且通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框,其中,第一重合度阈值大于第二重合度阈值。
  3. 如权利要求2所述的系统,其中,获取预测图像样本的步骤包括:获取图像,并且对获取的图像进行多尺度缩放来获取与所述图像对应的不同尺寸的多个预测图像样本,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:针对第一尺寸的预测图像样本,在利用所述文本位置检测模型确定了用于在第一尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸大于第一阈值的第一文本框,并且针对第二尺寸的预测图像样本,在利用所述文本位置检测模型确定了用于在第二尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸小于第二阈值的第二文本框,其中,第一尺寸小于第二尺寸;利用第三非极大值抑制操作对选择的第一文本框和第二文本框进行筛选,以得到用于在所述图像中定位文本位置的最终的文本框。
  4. 如权利要求2或3所述的系统,其中,所述级联的多级文本框分支是三级文本框分支,其中,利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框包括:
    利用第一级文本框分支,从特征图中提取与每个候选文本区域对应的特征并预测每个候选文本区域与真实文本区域的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并且根据第一级文本框分支的预测结果确定第一级水平文本框;
    利用第二级文本框分支,从特征图中提取与第一级水平文本框对应的特征并预测第一级水平文本框与真实文本区域的位置偏差以及第一级水平文本框包括文本的置信度和不包括文本的置信度,并根据第二级文本框分支的预测结果确定第二级水平文本框;
    利用第三级文本框分支,从特征图中提取与第二级水平文本框对应的特征并预测第二级水平文本框与真实文本区域的位置偏差以及第二级水平文本框包括文本的置信度和不包括文本的置信度,并根据第三级文本框分支的预测结果确定初始候选水平文本框。
  5. 如权利要求2所述的系统,其中,利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域的步骤包括:
    利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用第四非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域,
    其中,所述锚点框的宽高比是通过在所述文本位置检测模型的训练阶段对训练图像样本集中所标记的文本框的宽高比进行统计而确定的。
  6. 如权利要求2所述的系统,其中,根据预测出的文本的掩膜信息确定初选文本框包括:根据预测出的文本的掩膜信息确定包含文本的最小外接矩形,并将确定的最小外接矩形作为初选文本框。
  7. 如权利要求3所述的系统,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:在所述图像上显示用于在所述图像中定位文本位置的最终的文本框,其中,所述最终的文本框包括水平文本框和旋转文本框之中的至少一个。
  8. 如权利要求1所述的系统,其中,所述文本位置检测模型基于Mask-RCNN框架,特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。
  9. 如权利要求1所述的系统,其中,预测图像样本的特征包括预测图像样本中像素的相关度。
  10. 一种在图像中定位文本位置的方法,包括:
    获取预测图像样本;
    利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
  11. 如权利要求10所述的方法,其中,利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框的步骤包括:
    利用特征提取层提取预测图像样本的特征以生成特征图;
    利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域;
    利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框,并且通过第一非极大值抑制操作从初始候选水平文本框中筛选出文本框重合度小于第一重合度阈值的水平文本框作为候选水平文本框;
    利用掩膜分支,基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,根据预测出的文本的掩膜信息确定初选文本框,并且通过第二非极大值抑制操作从确定的初选文本框中筛选出文本框重合度小于第二重合度阈值的文本框作为所述最终的文本框,其中,第一重合度阈值大于第二重合度阈值。
  12. 如权利要求11所述的方法,其中,获取预测图像样本的步骤包括:获取图像, 并且对获取的图像进行多尺度缩放来获取与所述图像对应的不同尺寸的多个预测图像样本,所述方法还包括:针对第一尺寸的预测图像样本,在利用所述文本位置检测模型确定了用于在第一尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸大于第一阈值的第一文本框,并且针对第二尺寸的预测图像样本,在利用所述文本位置检测模型确定了用于在第二尺寸的预测图像样本中定位文本位置的的文本框之后从该文本框中选择尺寸小于第二阈值的第二文本框,其中,第一尺寸小于第二尺寸;利用第三非极大值抑制操作对选择的第一文本框和第二文本框进行筛选,以得到用于在所述图像中定位文本位置的最终的文本框。
  13. 如权利要求11或12所述的方法,其中,所述级联的多级文本框分支是三级文本框分支,其中,利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测初始候选水平文本框包括:
    利用第一级文本框分支,从特征图中提取与每个候选文本区域对应的特征并预测每个候选文本区域与真实文本区域的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并且根据第一级文本框分支的预测结果确定第一级水平文本框;
    利用第二级文本框分支,从特征图中提取与第一级水平文本框对应的特征并预测第一级水平文本框与真实文本区域的位置偏差以及第一级水平文本框包括文本的置信度和不包括文本的置信度,并根据第二级文本框分支的预测结果确定第二级水平文本框;
    利用第三级文本框分支,从特征图中提取与第二级水平文本框对应的特征并预测第二级水平文本框与真实文本区域的位置偏差以及第二级水平文本框包括文本的置信度和不包括文本的置信度,并根据第三级文本框分支的预测结果确定初始候选水平文本框。
  14. 如权利要求11所述的方法,其中,利用候选区域推荐层基于生成的特征图在预测图像样本中确定预定数量个的候选文本区域的步骤包括:
    利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用第四非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域,
    其中,所述锚点框的宽高比是通过在所述文本位置检测模型的训练阶段对训练图像样本集中所标记的文本框的宽高比进行统计而确定的。
  15. 如权利要求11所述的方法,其中,根据预测出的文本的掩膜信息确定初选文本框包括:根据预测出的文本的掩膜信息确定包含文本的最小外接矩形,并将确定的最小外接矩形作为初选文本框。
  16. 如权利要求12所述的方法,所述方法还包括:在所述图像上显示用于在所述图像中定位文本位置的最终的文本框,其中,所述最终的文本框包括水平文本框和旋转文本框之中的至少一个。
  17. 如权利要求10所述的方法,其中,所述文本位置检测模型基于Mask-RCNN框架,特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。
  18. 如权利要求10所述的方法,其中,预测图像样本的特征包括预测图像样本中像素的相关度。
  19. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求10至18中的任一权利要求所述的方法。
  20. 一种在图像中定位文本位置的系统,包括:
    预测图像样本获取装置,被配置为获取预测图像样本;
    文本位置定位装置,被配置为利用预先训练的基于深度神经网络的文本位置检测模型确定用于在预测图像样本中定位文本位置的最终的文本框,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取预测图像样本的特征以生成特征图,候选区域推荐层用于基于生成的特征图在预测图像样本中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在预测图像样本中定位文本位置的最终的文本框。
  21. 一种包括至少一个计算装置和存储指令的至少一个存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行机器学习建模过程的实现方法的以下步骤:
    获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;
    基于训练图像样本集训练基于深度神经网络的文本位置检测模型,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
  22. 如权利要求21所述的系统,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:在基于训练图像样本集训练所述文本位置检测模型之前,对训练图像样本集中的训练图像样本进行尺寸变换和透射变换之中的至少一个,以获得变换后的训练图像样本集,
    其中,对训练图像样本进行尺寸变换包括:在不保持训练图像样本的原始宽高比的情况下,对训练图像样本进行随机的尺寸变换使得训练图像样本的宽和高在预定范围内;
    对训练图像样本进行透射变换包括:使训练图像样本中像素的坐标分别绕x轴、y轴和z轴进行随机旋转。
  23. 如权利要求22所述的系统,其中,基于训练图像样本集训练所述文本位置检测模型的步骤包括:
    将经过变换的训练图像样本输入所述文本位置检测模型;
    利用特征提取层提取输入的训练图像样本的特征以生成特征图;
    利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域;
    利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测每个候选文本区域与文本框标记之间的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失;
    将所述预定数量个候选文本区域按照其对应的文本框预测损失进行排序,并根据排序结果筛选出文本框预测损失最大的前特定数量个的候选文本区域;
    利用掩膜分支基于特征图中与筛选出的候选文本区域对应的特征来预测筛选出的候选文本区域中的掩膜信息,并通过比较预测出的掩膜信息与文本的真实掩膜信息来计算掩膜预测损失;
    通过使文本框预测损失和掩膜预测损失的总和最小来训练文本位置检测模型。
  24. 如权利要求23所述的系统,其中,利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域包括:
    利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用非极大值抑制操作从初始候 选文本区域中筛选出所述预定数量个候选文本区域。
  25. 如权利要求24所述的系统,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置还执行以下步骤:在训练所述文本位置检测模型之前,统计变换后的训练图像样本集中标记的所有文本框的宽高比,并且根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合。
  26. 如权利要求25所述的系统,其中,根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合包括:
    将统计的所有文本框的宽高比进行排序;
    根据排序后的宽高比确定所述锚点框的宽高比的上限值和下限值,在上限值和下限值之间等比例地进行插值,并将由上限值和下限值以及通过插值得到的值构成的集合作为所述锚点框的宽高比集合。
  27. 如权利要求23所述的系统,其中,根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失包括:针对每个候选文本区域,分别根据每一级文本框分支的预测结果和文本框标记来计算每一级文本框分支的文本框预测损失,并通过将各级文本框分支的文本框预测损失求和来确定与每个候选文本区域对应的文本框预测损失,其中,文本框预测损失包括与每个候选文本区域对应的置信度预测损失和位置偏差预测损失,
    其中,针对每一级文本框分支设置的用于计算每一级文本框分支的文本框预测损失的重叠度阈值彼此不同,并且针对前一级文本框分支设置的重叠度阈值小于针对后一级文本框分支设置的重叠度阈值,其中,重叠度阈值是每一级文本框分支预测出的水平文本框与文本框标记之间的重叠度阈值。
  28. 如权利要求21所述的方法,其中,所述最终的文本框包括水平文本框和旋转文本框之中的至少一个。
  29. 如权利要求21所述的系统,其中,所述文本位置检测模型基于Mask-RCNN框架,特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。
  30. 如权利要求21所述的系统,其中,图像的特征包括图像中像素的相关度。
  31. 一种训练文本位置检测模型的方法,包括:
    获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;
    基于训练图像样本集训练基于深度神经网络的文本位置检测模型,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
  32. 如权利要求31所述的方法,所述方法还包括:在基于训练图像样本集训练所述文本位置检测模型之前,对训练图像样本集中的训练图像样本进行尺寸变换和透射变换之中的至少一个,以获得变换后的训练图像样本集,
    其中,对训练图像样本进行尺寸变换包括:在不保持训练图像样本的原始宽高比的情况下,对训练图像样本进行随机的尺寸变换使得训练图像样本的宽和高在预定范围内;
    对训练图像样本进行透射变换包括:使训练图像样本中像素的坐标分别绕x轴、y轴和z轴进行随机旋转。
  33. 如权利要求32所述的方法,其中,基于训练图像样本集训练所述文本位置检测模型的步骤包括:
    将经过变换的训练图像样本输入所述文本位置检测模型;
    利用特征提取层提取输入的训练图像样本的特征以生成特征图;
    利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域;
    利用级联的多级文本框分支基于特征图中的与每个候选文本区域对应的特征预测每个候选文本区域与文本框标记之间的位置偏差以及每个候选文本区域包括文本的置信度和不包括文本的置信度,并根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失;
    将所述预定数量个候选文本区域按照其对应的文本框预测损失进行排序,并根据排序结果筛选出文本框预测损失最大的前特定数量个的候选文本区域;
    利用掩膜分支基于特征图中与筛选出的候选文本区域对应的特征来预测筛选出的候选文本区域中的掩膜信息,并通过比较预测出的掩膜信息与文本的真实掩膜信息来计算掩膜预测损失;
    通过使文本框预测损失和掩膜预测损失的总和最小来训练文本位置检测模型。
  34. 如权利要求33所述的方法,其中,利用候选区域推荐层基于生成的特征图在输入的训练图像样本中确定预定数量个的候选文本区域包括:
    利用候选区域推荐层基于生成的特征图预测候选文本区域与预先设置的锚点框之间的差异,根据该差异和锚点框确定初始候选文本区域,并利用非极大值抑制操作从初始候选文本区域中筛选出所述预定数量个候选文本区域。
  35. 如权利要求34所述的方法,还包括:在训练所述文本位置检测模型之前,统计变换后的训练图像样本集中标记的所有文本框的宽高比,并且根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合。
  36. 如权利要求35所述的方法,其中,根据统计的所有文本框的宽高比设置所述锚点框的宽高比集合包括:
    将统计的所有文本框的宽高比进行排序;
    根据排序后的宽高比确定所述锚点框的宽高比的上限值和下限值,在上限值和下限值之间等比例地进行插值,并将由上限值和下限值以及通过插值得到的值构成的集合作为所述锚点框的宽高比集合。
  37. 如权利要求33所述的方法,其中,根据预测的位置偏差和置信度计算与每个候选文本区域对应的文本框预测损失包括:针对每个候选文本区域,分别根据每一级文本框分支的预测结果和文本框标记来计算每一级文本框分支的文本框预测损失,并通过将各级文本框分支的文本框预测损失求和来确定与每个候选文本区域对应的文本框预测损失,其中,文本框预测损失包括与每个候选文本区域对应的置信度预测损失和位置偏差预测损失,
    其中,针对每一级文本框分支设置的用于计算每一级文本框分支的文本框预测损失的重叠度阈值彼此不同,并且针对前一级文本框分支设置的重叠度阈值小于针对后一级文本框分支设置的重叠度阈值,其中,重叠度阈值是每一级文本框分支预测出的水平文本框与文本框标记之间的重叠度阈值。
  38. 如权利要求31所述的方法,其中,所述最终的文本框包括水平文本框和旋转文本框之中的至少一个。
  39. 如权利要求31所述的方法,其中,所述文本位置检测模型基于Mask-RCNN框架,特征提取层对应于Mask-RCNN框架中的深度残差网络,候选区域推荐层对应于Mask-RCNN框架中的区域推荐网络RPN层,级联的多级文本框分支中的每一级文本框分支包括Mask-RCNN框架中的RolAlign层和全连接层,掩膜分支包括一系列卷积层。
  40. 如权利要求31所述的方法,其中,图像的特征包括图像中像素的相关度。
  41. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求31至40中的任一权利要求所述的方法。
  42. 一种训练文本位置检测模型的系统,包括:
    训练图像样本集获取装置,被配置为获取训练图像样本集,其中,训练图像样本中对文本位置进行了文本框标记;
    模型训练装置,被配置为基于训练图像样本集训练基于深度神经网络的文本位置检测模型,
    其中,所述文本位置检测模型包括特征提取层、候选区域推荐层、级联的多级文本框分支以及掩膜分支,其中,特征提取层用于提取图像的特征以生成特征图,候选区域推荐层用于基于生成的特征图在图像中确定预定数量个候选文本区域,级联的多级文本框分支用于基于特征图中的与每个候选文本区域对应的特征来预测候选水平文本框,掩膜分支用于基于特征图中与候选水平文本框对应的特征来预测候选水平文本框中的文本的掩膜信息,并根据预测出的掩膜信息确定用于在图像中定位文本位置的最终的文本框。
PCT/CN2020/103799 2019-07-26 2020-07-23 文本位置定位方法和系统以及模型训练方法和系统 WO2021017998A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910682132.XA CN110414499B (zh) 2019-07-26 2019-07-26 文本位置定位方法和系统以及模型训练方法和系统
CN201910682132.X 2019-07-26

Publications (1)

Publication Number Publication Date
WO2021017998A1 true WO2021017998A1 (zh) 2021-02-04

Family

ID=68363166

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/103799 WO2021017998A1 (zh) 2019-07-26 2020-07-23 文本位置定位方法和系统以及模型训练方法和系统

Country Status (2)

Country Link
CN (2) CN110414499B (zh)
WO (1) WO2021017998A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989995A (zh) * 2021-03-10 2021-06-18 北京百度网讯科技有限公司 文本检测方法、装置及电子设备
CN113012383A (zh) * 2021-03-26 2021-06-22 深圳市安软科技股份有限公司 火灾检测报警方法、相关系统、相关设备及存储介质
CN113033346A (zh) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 文本检测方法、装置和电子设备
CN113160144A (zh) * 2021-03-25 2021-07-23 平安科技(深圳)有限公司 目标物检测方法、装置、电子设备及存储介质
CN113205041A (zh) * 2021-04-29 2021-08-03 百度在线网络技术(北京)有限公司 结构化信息提取方法、装置、设备和存储介质
CN113298079A (zh) * 2021-06-28 2021-08-24 北京奇艺世纪科技有限公司 一种图像处理方法、装置、电子设备及存储介质
CN113326766A (zh) * 2021-05-27 2021-08-31 北京百度网讯科技有限公司 文本检测模型的训练方法及装置、文本检测方法及装置
CN113343970A (zh) * 2021-06-24 2021-09-03 中国平安人寿保险股份有限公司 文本图像检测方法、装置、设备及存储介质
CN113420174A (zh) * 2021-05-25 2021-09-21 北京百度网讯科技有限公司 难样本挖掘方法、装置、设备以及存储介质
CN113963341A (zh) * 2021-09-03 2022-01-21 中国科学院信息工程研究所 基于多层感知机掩膜解码器的文字检测系统及方法
CN115358392A (zh) * 2022-10-21 2022-11-18 北京百度网讯科技有限公司 深度学习网络的训练方法、文本检测方法及装置
CN116092087A (zh) * 2023-04-10 2023-05-09 上海蜜度信息技术有限公司 Ocr识别方法、系统、存储介质及电子设备
CN116503517A (zh) * 2023-06-27 2023-07-28 江西农业大学 长文本生成图像的方法及系统
CN116935393A (zh) * 2023-07-27 2023-10-24 中科微至科技股份有限公司 一种基于ocr技术的包裹表面信息提取方法及系统
CN117934486A (zh) * 2024-03-25 2024-04-26 国网辽宁省电力有限公司电力科学研究院 变压器元件检测方法、装置、电子设备和存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414499B (zh) * 2019-07-26 2021-06-04 第四范式(北京)技术有限公司 文本位置定位方法和系统以及模型训练方法和系统
CN111091123A (zh) * 2019-12-02 2020-05-01 上海眼控科技股份有限公司 文本区域检测方法及设备
CN111259846B (zh) * 2020-01-21 2024-04-02 第四范式(北京)技术有限公司 文本定位方法和系统以及文本定位模型训练方法和系统
CN111582021B (zh) * 2020-03-26 2024-07-05 平安科技(深圳)有限公司 场景图像中的文本检测方法、装置及计算机设备
CN113449722A (zh) * 2020-03-27 2021-09-28 北京有限元科技有限公司 对图像中的文本信息进行定位以及识别的方法以及装置
CN111950453B (zh) * 2020-08-12 2024-02-13 北京易道博识科技有限公司 一种基于选择性注意力机制的任意形状文本识别方法
CN113033660B (zh) * 2021-03-24 2022-08-02 支付宝(杭州)信息技术有限公司 一种通用小语种检测方法、装置以及设备
CN113762109B (zh) * 2021-08-23 2023-11-07 北京百度网讯科技有限公司 一种文字定位模型的训练方法及文字定位方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (zh) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 对图片进行文本检测的方法及装置
CN106384112A (zh) * 2016-09-08 2017-02-08 西安电子科技大学 基于多通道多尺度与级联过滤器的快速图像文本检测方法
CN109117876A (zh) * 2018-07-26 2019-01-01 成都快眼科技有限公司 一种稠密小目标检测模型构建方法、模型及检测方法
CN109492638A (zh) * 2018-11-07 2019-03-19 北京旷视科技有限公司 文本检测方法、装置及电子设备
CN109993040A (zh) * 2018-01-03 2019-07-09 北京世纪好未来教育科技有限公司 文本识别方法及装置
CN110414499A (zh) * 2019-07-26 2019-11-05 第四范式(北京)技术有限公司 文本位置定位方法和系统以及模型训练方法和系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070115510A1 (en) * 2005-11-18 2007-05-24 International Business Machines Corporation Marking images of text with speckle patterns for theft deterrence
KR20130124572A (ko) * 2005-12-30 2013-11-14 스티븐 케이스 천부적인 적응성의 디자인
US8705873B2 (en) * 2008-03-20 2014-04-22 Universite De Geneve Secure item identification and authentication system and method based on unclonable features
CN108108731B (zh) * 2016-11-25 2021-02-05 中移(杭州)信息技术有限公司 基于合成数据的文本检测方法及装置
CN108549893B (zh) * 2018-04-04 2020-03-31 华中科技大学 一种任意形状的场景文本端到端识别方法
CN108830192A (zh) * 2018-05-31 2018-11-16 珠海亿智电子科技有限公司 车载环境下基于深度学习的车辆与车牌检测方法
CN109325412B (zh) * 2018-08-17 2023-11-24 平安科技(深圳)有限公司 行人识别方法、装置、计算机设备及存储介质
CN110310262A (zh) * 2019-06-19 2019-10-08 上海理工大学 一种用于检测轮胎缺陷的方法、装置及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (zh) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 对图片进行文本检测的方法及装置
CN106384112A (zh) * 2016-09-08 2017-02-08 西安电子科技大学 基于多通道多尺度与级联过滤器的快速图像文本检测方法
CN109993040A (zh) * 2018-01-03 2019-07-09 北京世纪好未来教育科技有限公司 文本识别方法及装置
CN109117876A (zh) * 2018-07-26 2019-01-01 成都快眼科技有限公司 一种稠密小目标检测模型构建方法、模型及检测方法
CN109492638A (zh) * 2018-11-07 2019-03-19 北京旷视科技有限公司 文本检测方法、装置及电子设备
CN110414499A (zh) * 2019-07-26 2019-11-05 第四范式(北京)技术有限公司 文本位置定位方法和系统以及模型训练方法和系统

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989995B (zh) * 2021-03-10 2024-02-20 北京百度网讯科技有限公司 文本检测方法、装置及电子设备
CN113033346A (zh) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 文本检测方法、装置和电子设备
CN113033346B (zh) * 2021-03-10 2023-08-04 北京百度网讯科技有限公司 文本检测方法、装置和电子设备
CN112989995A (zh) * 2021-03-10 2021-06-18 北京百度网讯科技有限公司 文本检测方法、装置及电子设备
CN113160144A (zh) * 2021-03-25 2021-07-23 平安科技(深圳)有限公司 目标物检测方法、装置、电子设备及存储介质
CN113160144B (zh) * 2021-03-25 2023-05-26 平安科技(深圳)有限公司 目标物检测方法、装置、电子设备及存储介质
CN113012383A (zh) * 2021-03-26 2021-06-22 深圳市安软科技股份有限公司 火灾检测报警方法、相关系统、相关设备及存储介质
CN113012383B (zh) * 2021-03-26 2022-12-30 深圳市安软科技股份有限公司 火灾检测报警方法、相关系统、相关设备及存储介质
CN113205041A (zh) * 2021-04-29 2021-08-03 百度在线网络技术(北京)有限公司 结构化信息提取方法、装置、设备和存储介质
CN113205041B (zh) * 2021-04-29 2023-07-28 百度在线网络技术(北京)有限公司 结构化信息提取方法、装置、设备和存储介质
CN113420174B (zh) * 2021-05-25 2024-01-09 北京百度网讯科技有限公司 难样本挖掘方法、装置、设备以及存储介质
CN113420174A (zh) * 2021-05-25 2021-09-21 北京百度网讯科技有限公司 难样本挖掘方法、装置、设备以及存储介质
CN113326766A (zh) * 2021-05-27 2021-08-31 北京百度网讯科技有限公司 文本检测模型的训练方法及装置、文本检测方法及装置
CN113326766B (zh) * 2021-05-27 2023-09-29 北京百度网讯科技有限公司 文本检测模型的训练方法及装置、文本检测方法及装置
CN113343970A (zh) * 2021-06-24 2021-09-03 中国平安人寿保险股份有限公司 文本图像检测方法、装置、设备及存储介质
CN113343970B (zh) * 2021-06-24 2024-03-08 中国平安人寿保险股份有限公司 文本图像检测方法、装置、设备及存储介质
CN113298079A (zh) * 2021-06-28 2021-08-24 北京奇艺世纪科技有限公司 一种图像处理方法、装置、电子设备及存储介质
CN113298079B (zh) * 2021-06-28 2023-10-27 北京奇艺世纪科技有限公司 一种图像处理方法、装置、电子设备及存储介质
CN113963341A (zh) * 2021-09-03 2022-01-21 中国科学院信息工程研究所 基于多层感知机掩膜解码器的文字检测系统及方法
CN115358392A (zh) * 2022-10-21 2022-11-18 北京百度网讯科技有限公司 深度学习网络的训练方法、文本检测方法及装置
CN116092087A (zh) * 2023-04-10 2023-05-09 上海蜜度信息技术有限公司 Ocr识别方法、系统、存储介质及电子设备
CN116092087B (zh) * 2023-04-10 2023-08-08 上海蜜度信息技术有限公司 Ocr识别方法、系统、存储介质及电子设备
CN116503517B (zh) * 2023-06-27 2023-09-05 江西农业大学 长文本生成图像的方法及系统
CN116503517A (zh) * 2023-06-27 2023-07-28 江西农业大学 长文本生成图像的方法及系统
CN116935393A (zh) * 2023-07-27 2023-10-24 中科微至科技股份有限公司 一种基于ocr技术的包裹表面信息提取方法及系统
CN117934486A (zh) * 2024-03-25 2024-04-26 国网辽宁省电力有限公司电力科学研究院 变压器元件检测方法、装置、电子设备和存储介质
CN117934486B (zh) * 2024-03-25 2024-06-07 国网辽宁省电力有限公司电力科学研究院 变压器元件检测方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN110414499A (zh) 2019-11-05
CN113159016B (zh) 2024-06-18
CN113159016A (zh) 2021-07-23
CN110414499B (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
WO2021017998A1 (zh) 文本位置定位方法和系统以及模型训练方法和系统
WO2021147817A1 (zh) 文本定位方法和系统以及文本定位模型训练方法和系统
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
US10242289B2 (en) Method for analysing media content
CN111259751B (zh) 基于视频的人体行为识别方法、装置、设备及存储介质
CN108875537B (zh) 对象检测方法、装置和系统及存储介质
CN107766349B (zh) 一种生成文本的方法、装置、设备及客户端
CN108875750B (zh) 物体检测方法、装置和系统及存储介质
US9690980B2 (en) Automatic curation of digital images
US20230137337A1 (en) Enhanced machine learning model for joint detection and multi person pose estimation
CN111160288A (zh) 手势关键点检测方法、装置、计算机设备和存储介质
CN106650743B (zh) 图像强反光检测方法和装置
JP7242994B2 (ja) ビデオイベント識別方法、装置、電子デバイス及び記憶媒体
CN111626163A (zh) 一种人脸活体检测方法、装置及计算机设备
JP2023527615A (ja) 目標対象検出モデルのトレーニング方法、目標対象検出方法、機器、電子機器、記憶媒体及びコンピュータプログラム
US20230118361A1 (en) User input based distraction removal in media items
CN113205047A (zh) 药名识别方法、装置、计算机设备和存储介质
CN113160231A (zh) 一种样本生成方法、样本生成装置及电子设备
CN111199169A (zh) 图像处理方法和装置
CN115439543A (zh) 孔洞位置的确定方法和元宇宙中三维模型的生成方法
US20230401809A1 (en) Image data augmentation device and method
CN111582012A (zh) 一种检测小目标船只方法及装置
CN106469437B (zh) 图像处理方法和图像处理装置
US11308150B2 (en) Mobile device event control with topographical analysis of digital images inventors
CN112949526A (zh) 人脸检测方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20848252

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20848252

Country of ref document: EP

Kind code of ref document: A1