WO2019057169A1 - 文本检测方法、存储介质和计算机设备 - Google Patents

文本检测方法、存储介质和计算机设备 Download PDF

Info

Publication number
WO2019057169A1
WO2019057169A1 PCT/CN2018/107032 CN2018107032W WO2019057169A1 WO 2019057169 A1 WO2019057169 A1 WO 2019057169A1 CN 2018107032 W CN2018107032 W CN 2018107032W WO 2019057169 A1 WO2019057169 A1 WO 2019057169A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
region
sub
predicted
matrix
Prior art date
Application number
PCT/CN2018/107032
Other languages
English (en)
French (fr)
Inventor
刘铭
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019057169A1 publication Critical patent/WO2019057169A1/zh
Priority to US16/572,171 priority Critical patent/US11030471B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of computer technology, and in particular, to a text detection method, a storage medium, and a computer device.
  • the usual target object detection method can directly predict the candidate region where the target object is located, and can be based on the prediction candidate region. A part of the characteristics of the target object infers the object type and realizes the detection of the target object.
  • the text is different from the usual object.
  • the boundary of the text changes with the stroke, and there may be spaces between the characters. It is difficult to determine the type of the text by a part of the text. Therefore, it is easy to use the traditional target object detection algorithm due to the text space.
  • the error detection and the missed detection are caused, and since the category of the entire text cannot be predicted according to the partial text in the prediction candidate region, the accuracy of the text positioning is not high, and the detection robustness is low.
  • a text detection method a storage medium, and a computer device are provided.
  • a text detection method comprising:
  • the computer device acquires an image to be detected
  • the computer device inputs the image to be detected into a neural network model, and outputs a target feature matrix
  • the computer device inputs the target feature matrix to the fully connected layer, and the fully connected layer maps each element of the target feature matrix to a predicted image sub-region corresponding to the image to be detected according to a preset anchor point region;
  • the computer device acquires text feature information of the predicted image sub-region, and associates the predicted image sub-region into a corresponding predicted text row according to the text feature information of the predicted image sub-region, and determines the to-be-detected The text area corresponding to the image.
  • a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
  • the fully connected layer mapping each element of the target feature matrix to a predicted image sub-region corresponding to the image to be detected according to a preset anchor point region;
  • One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
  • the fully connected layer mapping each element of the target feature matrix to a predicted image sub-region corresponding to the image to be detected according to a preset anchor point region;
  • FIG. 1 is a flow chart of a text detecting method in an embodiment
  • FIG. 2 is a flow chart of a method for generating a target feature matrix in an embodiment
  • FIG. 3 is a flow chart of a method for generating a target feature matrix in another embodiment
  • FIG. 3A is a flowchart of acquiring text feature information of a predicted image sub-region in an embodiment
  • FIG. 4 is a flow chart of a method for generating a predicted text line in an embodiment
  • FIG. 5 is a flowchart of a text detection model training method in an embodiment
  • FIG. 6 is a schematic structural diagram of a text detecting method in an embodiment
  • FIG. 7 is a flow chart of a text detection method in a specific embodiment
  • Figure 8 is a block diagram showing the structure of a text detecting apparatus in an embodiment
  • FIG. 9 is a structural block diagram of a feature matrix generating module in an embodiment
  • FIG. 10 is a structural block diagram of a text area determining module in an embodiment
  • FIG. 11 is a structural block diagram of a text area determining module in another embodiment
  • Figure 12 is a block diagram showing the structure of a text detecting apparatus in another embodiment
  • Figure 13 is a block diagram showing the structure of a text detecting apparatus in still another embodiment
  • Figure 14 is a diagram showing the internal structure of a computer device in an embodiment
  • Figure 15 is a diagram showing the internal structure of a computer device in another embodiment.
  • the text detection method in the embodiment of the present application may be applied to a computer device.
  • the computer device may be an independent physical server or a terminal, or may be a server cluster composed of multiple physical servers, and may be a cloud server, a cloud database, a cloud storage, and the like.
  • Cloud server for basic cloud computing services such as CDN.
  • the terminal can be a smart phone, a tablet, a laptop, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the display screen of the terminal may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing, or It is an external keyboard, trackpad or mouse.
  • the touch layer and display form a touch screen.
  • a text detection method including the following:
  • Step S110 acquiring an image to be detected.
  • the image to be detected refers to an image to be subjected to text detection. It is detected whether the text area is included in the image to be detected and the position of the text area is determined.
  • the image to be detected may be various types of images such as an ID card, a business card, an advertisement picture, a video screenshot, and the like, and the scale of the text in the image to be detected may be arbitrary.
  • step S120 the image to be detected is input to the neural network model, and the target feature matrix is output.
  • the image to be detected may be input to a neural network model for feature extraction, and the extracted features are subjected to corresponding convolution processing to obtain a corresponding target feature matrix.
  • the neural network model may be used as a feature extractor to extract features from the image to be detected, and then the extracted features are input into different neural network models to output a target feature matrix. If the residual image network is used for feature extraction, the number of layers of the residual network can be arbitrarily set as needed, and the image features extracted by increasing the number of layers are generally improved. It is also possible to perform feature extraction on the detected image using other network structures such as VGG19, Res50, and ResNet101.
  • the extracted features are input to a memory network model for processing the output target feature matrix.
  • the dimension of the input image to be detected may be changed, and the feature dimension obtained by feature extraction of the image to be detected is also changed.
  • the target feature matrix can be thought of as a sequence that characterizes image feature values.
  • Step S130 the target feature matrix is input to the fully connected layer, and the fully connected layer maps each element of the target feature matrix to the predicted image sub-region corresponding to the image to be detected according to the preset anchor point region.
  • the fully connected layer refers to a convolution layer, which can be realized by a convolution operation, and functions as a "classifier" in the convolutional neural network, and can map features to the sample space.
  • the anchor region determines the mapping range of the original image, and indicates the region of interest for the detection model. By performing multiple scale and aspect ratio transformations on the anchor region, it is possible to detect the multi-scale and aspect ratio text.
  • the width of the preset anchor point region is a fixed value. Setting the width of the anchor point area to a fixed value can detect the image to be detected within the range of the preset width, and the text in the horizontal direction is small in a small range, which can improve the accuracy of text detection.
  • the height value of the preset anchor point area can be changed, for example, the height value is set to 7, 11, 18, 25, 35, 56, 67, 88, 100, 168, 278, etc., and the anchor point is realized by the changed height value.
  • the area can cover as many targets as possible in the actual scene.
  • the all-connected layer maps the features corresponding to the respective elements of the target feature matrix to the image to be detected according to the preset anchor point region, and obtains the corresponding image sub-regions of the respective features in the image to be detected.
  • the width value of the anchor region is fixed, the width of the image sub-region corresponding to the feature mapped to the original image is fixed, and only the width of the image sub-region needs to be predicted to determine the position information of each image sub-region.
  • the anchor area width is fixed, only the height value of the image sub-area needs to be predicted to reduce the search space optimized by the model.
  • mapping each element of the target feature matrix back to the original image to obtain a corresponding image sub-region performing text detection on each image sub-region, realizing segmentation of the image to be detected, and dividing an original image into several images.
  • the area performs text detection.
  • Step S140 Acquire text feature information of the predicted image sub-region, and connect the predicted image sub-region to the corresponding predicted text row according to the text feature information of the predicted image sub-region, and determine the text region corresponding to the image to be detected.
  • the text feature information refers to information reflecting the text attribute, the text attribute includes a text position within the image, the text confidence, and the text feature information of the predicted image sub-region includes text position information and text confidence of the predicted image sub-area, and the text position
  • the information can be determined by predicting the 2K vertical coordinate offset, 1K text line horizontal boundary offset, where K is the preset number of anchor points, which can be preset as needed.
  • Text confidence refers to the probability that the content contained in the preset image sub-region is text.
  • a text clustering algorithm refers to an algorithm or a predefined rule that enables image sub-regions to be joined into corresponding text lines. If you enter the picture of the ID card, you can get the coordinates of the upper left and lower right corners of each character in the ID card and the confidence level.
  • the image sub-regions in the same text line are acquired according to a certain rule to be connected, and the plurality of image sub-regions are connected into corresponding text lines, and the single image is The sub-regions are connected into corresponding text lines, and the text regions corresponding to the image to be detected are determined as a whole in units of rows, thereby avoiding false detections due to the presence of spaces in the characters extracted from the image sub-region positions.
  • the image to be detected is input to the neural network model to obtain a target feature matrix, and the target feature matrix is mapped to the image sub-region corresponding to the image to be detected according to the preset anchor point region through the fully connected layer, and the image sub-region is connected into The text line is predicted to determine the text area of the image to be detected.
  • the text feature information is used to reflect the text feature of the predicted image sub-region, and the segmentation process of the image to be detected is implemented, and the text feature of the image to be detected is detected by each predicted image sub-region, and further, according to the text feature information and text of the predicted image sub-region
  • the clustering algorithm connects the adjacent sub-regions of the predicted image into corresponding text lines, and realizes the detection of the text in a small range.
  • the clustering algorithm Since the text usually changes less in a small range, the accuracy of the detection is improved, and the text is improved.
  • the clustering algorithm generates adjacent text lines of adjacent predicted image sub-regions. Since the predicted image sub-regions are merged, even if there are spaces in the text, the adjacent predicted image sub-regions can be combined to include spaces. The characters are merged into full characters, which enhances the text. Robust detection.
  • step S120 includes:
  • Step S121 performing feature extraction on the image to be detected to obtain a first feature matrix, and the elements in the first feature matrix are two-dimensional elements.
  • the residual network is used as a multi-layer convolution feature extractor to perform feature extraction on the image to be detected, and a feature matrix obtained by multi-layer convolution is obtained.
  • the extracted elements in the feature matrix are two-dimensional elements, which can represent the position corresponding to the feature.
  • the number of layers of the residual network used may be set as needed, for example, it is set to 50 layers, and the feature image to be detected is detected by using Res50.
  • the number of layers of the residual network is increased, but the image features are increased, but increased to a certain number of layers. For example, after 152 layers, the effect is gradually not obvious.
  • feature extraction of the image to be detected may also be performed using other network structures such as VGG19 and ResNet101.
  • Step S122 the first feature matrix is input into the bidirectional long-term and short-term memory network model to obtain a forward feature matrix and a backward feature matrix.
  • the long-term and short-term memory network model refers to LSTM (Long Short-Term Memory), a time recurrent neural network.
  • the two-way long-term and short-term memory network model includes a forward long-term and short-term memory network model and a backward long-term and short-term memory network model.
  • the feature extraction of the image to be detected reflects the local information of the image, and a word or sentence usually includes a plurality of characters, and the characters have strong correlations, and in order to reflect the global information of the image, the image is extracted.
  • the feature is input to the sequence information contained in the mining text area in the LSTM, and the relationship between the characters is obtained.
  • Two long-term and short-term memory network models are used to model the left and right character sequences to form the completed sequence information, and the feature matrix is used to reflect the corresponding sequence information.
  • the first feature matrix is input to the forward long-term memory network model and the backward long-term and short-term memory network model respectively, and the forward long-term and short-term memory network model processes the first feature matrix to obtain a forward feature matrix, and the forward feature
  • the matrix reflects the forward sequence information
  • the backward long-term and short-term memory network model processes the first feature matrix to obtain a backward feature matrix
  • the backward feature matrix reflects the backward sequence information
  • the sequence information represents the image sub-region corresponding to the feature element. Connection relationship.
  • step S123 the forward feature matrix and the backward feature matrix are spliced to obtain a target feature matrix.
  • the forward feature matrix and the backward feature matrix are spliced to obtain the target feature matrix. Since the forward feature matrix reflects the forward sequence information and the backward feature matrix reflects the backward sequence information, the target feature matrix can reflect the corresponding elements.
  • the sequence information of the image sub-regions represents the connection relationship of the image sub-regions corresponding to the respective elements.
  • feature extraction is performed on the image to be detected in advance, and the feature obtained by the extraction is processed to obtain a target feature matrix, and the processing of the original image is converted into a process corresponding to the feature of the original image, thereby greatly reducing the information processing.
  • the picture sharing feature extraction layer avoids the problem of double counting and improves the efficiency of information processing.
  • the two-way long-term and short-term memory network model is used to extract the forward and backward sequence information respectively, which can more completely reflect the relationship between feature elements and improve the accuracy of subsequent text area determination.
  • step S122 includes:
  • Step S122A Acquire a current position of the current sliding window matrix, and calculate a current convolution result of the current sliding window matrix and the first characteristic matrix according to the current position, where the current sliding window matrix includes a forward sliding window matrix and a backward sliding window matrix.
  • the sliding window matrix refers to a matrix that can slide and convolve with the target matrix at each position of the sliding.
  • the sliding window matrix may be a convolution kernel set according to requirements.
  • the scale of the sliding window matrix may be determined by setting a corresponding sliding window scale. For example, if the sliding window size corresponding to the sliding window matrix is set to 3*3, the sliding window matrix is 3*. 3 matrix.
  • the first feature matrix is input to the forward long-term and short-term memory network model and the backward long-term and short-term memory network model respectively, the features extracted by the forward and backward long-term and short-term memory network models are different, that is, the first feature The matrix convolution results are different. Therefore, in the forward long-term and short-term memory network model and the backward long-term and short-term memory network model, different sliding window matrices are respectively convoluted with the first feature matrix to obtain a target feature matrix. Further, the sliding window matrix is convolved with the first feature matrix at different positions to obtain different convolution results, and the current position of the sliding window matrix is obtained, and the first feature matrix overlaps with the sliding matrix when the sliding window matrix is at the current position. Partial convolution with the sliding window matrix yields the corresponding convolution result.
  • Step S122B The internal state value corresponding to the current position of the long-term and short-term memory network model is obtained by using the activation function according to the internal state value of the long-short-term memory network model corresponding to the previous position of the current sliding window matrix.
  • the activation function refers to a function for updating neural network parameters.
  • the internal state value corresponding to the current position of the long-short-term memory network model is calculated by using the corresponding convolution result of the sliding window matrix at the current position and the internal state value of the neural network model corresponding to the previous position.
  • H(t) Using the activation function to periodically update the internal state value H(t) corresponding to the current position of the long- and short-term memory network model,
  • the convolution result generated by the sliding window at the corresponding position at time t and the first feature matrix, and H t-1 represents the internal state value of the short-term memory model at time t-1.
  • H(t) ⁇ R 256 , R represents a real set. .
  • Step S122C sliding the current sliding window matrix to obtain the next position, and proceeding to step S122A until the current sliding window matrix traverses the elements of the first feature matrix.
  • the sliding window matrix can slide on the first feature matrix, and each time one pixel position is moved, the sliding window matrix slides to each position corresponding to a convolution result, and the internal state of the neural network model corresponding to the current position sliding window matrix is obtained.
  • the current sliding window matrix is swept to the next position, and the process proceeds to step S122A, and the internal state value of the neural network model corresponding to the current position of the sliding window matrix after the sliding is calculated, and the above process is repeatedly performed until the current sliding window matrix is traversed.
  • the elements of the first feature matrix obtain the internal state values of the neural network model corresponding to each position of the current sliding window.
  • the width of the preset anchor point region is set to a fixed value such as 16, the sliding window matrix slides one pixel on the first feature matrix, corresponding to 16 pixels in the image to be detected.
  • Step S122D Process the internal state values corresponding to the respective current sliding window matrices at different positions to generate a current feature matrix.
  • the internal state value corresponding to the long-term and short-term memory network model is an intermediate result of processing the first feature matrix by the long-term and short-term memory network model, and further processing or convolution of the internal state value is required to generate a corresponding current feature matrix.
  • the current feature matrix includes a forward feature matrix and a backward feature matrix, and the forward feature matrix and the backward feature matrix are spliced into a target feature matrix output.
  • different sliding window matrices are used to convolve with the first feature matrix in the forward and backward long-term and short-term memory network models, respectively, and the sliding window matrix is obtained in the forward and backward long-term and short-term memory network models respectively.
  • the convolution result corresponding to each position, and the activation function is used to calculate the internal state value corresponding to each location long-term and short-term memory network model, and the corresponding current feature matrix is obtained according to the obtained internal state value.
  • the width value of the preset anchor point area is a fixed value
  • the step of acquiring the text feature information of the predicted image sub-area includes:
  • Step S141 Acquire a horizontal position of each predicted image sub-region according to a width value of the preset anchor point region and a first dimension coordinate corresponding to each element of the target feature matrix.
  • the width value of the preset anchor point area is a fixed value, and the width value can be set empirically, such as set to 16 pixels.
  • the width value of the preset anchor point area is determined, the width value of each predicted image sub-area mapped to the image to be detected through the fully connected layer is fixed, and the preset anchor point area position is fixed, according to the target feature matrix in the fully connected layer.
  • the position of the map can be determined to map to the horizontal position in the original map according to the preset anchor point area.
  • Step S142 acquiring a vertical direction prediction offset of each predicted image sub-region, and calculating according to the vertical direction prediction offset, the corresponding preset anchor point region height value, and the central coordinate numerical component, respectively obtaining respective predicted images.
  • the text detection model is used to perform text detection on the image to be detected, and the text detection model is trained in advance, so that the text detection model can predict the vertical component of the prediction center point corresponding to each sub-region of the prediction image in the process of processing the image to be detected. Then, according to the vertical component of the predicted center point, the predicted height value corresponding to each image sub-region and the actual center point vertical component are obtained. Use the following formula to calculate:
  • v c represents the predicted value of the regression target of the vertical component of the center point of the text block
  • c y represents the vertical component of the center point of the predicted text block.
  • h a represents the height of the corresponding preset anchor point region
  • v h represents the predicted value of the height regression target of the text block
  • h represents the height of the predicted text block.
  • the prediction horizontal boundary offset v c corresponding to each predicted text row can be obtained according to the 2K vertical coordinate component offset prediction task, and the horizontal offset calculation formula is used. It is possible to inversely derive the actual center point vertical component c y corresponding to each predicted text line and the height h of the predicted image sub-region .
  • Step S143 determining text position information of each predicted image sub-area according to the horizontal position, the predicted height value, and the actual offset of the center point vertical direction.
  • the height value and the central point vertical direction offset amount are predicted, and according to the preset anchor point region position, the corresponding predicted image sub-region corresponding to the image to be detected is determined. Coordinates, thereby determining text position information for each predicted image sub-region.
  • the horizontal position, the height value, and the horizontal direction offset corresponding to each of the predicted image sub-regions are obtained by model prediction, and the coordinates of each predicted image sub-region in the image to be detected are determined, thereby determining the predicted image sub-regions.
  • the text position information provides a basis for subsequently connecting the predicted image sub-areas to form a text line, and the preset anchor point area has a fixed value, and the detected text in the preset horizontal range is more feasible, further Only the height value of the predicted image sub-area needs to be predicted, which reduces the search space optimized by the model.
  • the text feature information includes text location information
  • step S140 includes:
  • Step S140A Taking each predicted image sub-region as a candidate text sub-region, acquiring first text position information corresponding to the current candidate text sub-region.
  • the candidate text sub-region refers to a sub-region predicted as text in the image to be detected, and the predicted image sub-region mapped to the original image according to the target feature matrix is used as the candidate text sub-region.
  • the current candidate text sub-region may be an arbitrarily selected one of the candidate text sub-regions, and the location information corresponding to the text sub-region is obtained.
  • Step S140B Acquire, according to the first text position information, a target candidate text sub-region that is smaller than a preset distance threshold and whose vertical overlap is greater than a preset overlap degree according to the first text position information, and is closest to the current candidate text sub-region.
  • the target candidate text sub-region is used as an adjacent candidate text sub-region.
  • each independent predicted image sub-region represents a feature of each position of the image
  • the text corresponding to one image sub-region may not be complete text, and adjacent texts need to be combined to accurately predict the complete text.
  • the text is generally in the text line unit, and the two adjacent image sub-regions in the same text line are close to each other. Therefore, the adjacent candidate text sub-region corresponding to the current candidate text sub-region is obtained by setting the conditions of the horizontal direction and the vertical direction. .
  • the distance threshold of the two candidate text sub-regions in the horizontal direction is set in advance, and the distance threshold may be set according to experience or may be set according to position information of each predicted image sub-region. Presetting the overlap degree of the two candidate text sub-regions in the vertical direction, since the text sub-regions in the same text line are basically on the same straight line, there should be a high degree of overlap in the vertical direction, and the overlap can be set according to experience. The value of the degree. If the distance threshold in the horizontal direction is set to 50 pixels in advance, the degree of overlap in the vertical direction is 0.7.
  • the nearest target candidate text sub-region serves as the adjacent candidate text sub-region of the current candidate text sub-region.
  • Step S140C acquiring the next candidate text sub-region corresponding to the current candidate text sub-region, and using the next candidate text sub-region as the current candidate text sub-region, and entering the step of acquiring the first text location information corresponding to the current candidate text sub-region, until step S140C Traverse the candidate text sub-regions.
  • the process of determining the adjacent candidate text sub-regions is repeated until each candidate text sub-region is used as the current candidate text sub-region, until the adjacent candidate text sub-region corresponding to each candidate text sub-region is determined.
  • Step S140D the current candidate text sub-region and the corresponding adjacent candidate text sub-region are connected into a corresponding predicted text row.
  • each candidate text sub-region is connected to the corresponding adjacent candidate text sub-region, so that the candidate text sub-regions corresponding to the same text row can be connected to each other to obtain a corresponding text row region.
  • the text area in the image to be detected is determined in units of rows.
  • each candidate text sub-region corresponding to each candidate text sub-region is obtained by a preset condition, and each candidate text sub-region is connected with the adjacent candidate text sub-region to predict a text line corresponding to the image to be detected.
  • the text area of the image to be detected is reflected in the unit of behavior, and the problem that the text information obtained by the single candidate text sub-area is incomplete is avoided, and the text area of the image to be detected can be more accurately reflected.
  • the text feature information includes a text confidence
  • the step of using each of the predicted image sub-regions as the candidate text sub-region includes: acquiring a text confidence corresponding to each of the predicted image sub-regions; and each predicted image according to the text confidence The region performs non-maximum suppression, and obtains a predicted image sub-region with a text confidence greater than a preset text confidence as a candidate text sub-region.
  • the detector detects the text confidence corresponding to each predicted image sub-region, and determines the probability that each predicted image sub-region is a text sub-region, since the image is to be detected through the fully connected layer, each There may be a plurality of predicted image sub-regions corresponding to an anchor region. In order to generate a corresponding text row better, the predicted image sub-region is filtered to obtain a predicted image sub-region with a text confidence greater than a preset text confidence.
  • the preset text confidence is set to 0.7
  • the predicted image sub-region with the text confidence greater than 0.7 is obtained according to the text confidence corresponding to the predicted image sub-region, and the predicted image sub-region satisfying the condition is used as the candidate text sub-region, and the subsequent connection is performed. The operation of the text line.
  • the predicted image sub-region before acquiring the adjacent image sub-region corresponding to each predicted image sub-region according to the preset condition, the predicted image sub-region is filtered according to the text confidence degree, and the text confidence exceeds the preset text confidence prediction.
  • the image sub-region is used as the candidate text sub-region, which reduces the calculation time of acquiring adjacent text sub-regions, improves the accuracy of the predicted text lines, and improves the accuracy of subsequent text recognition results.
  • the method further includes: acquiring a predicted horizontal direction offset corresponding to each predicted text line, and correcting a horizontal boundary of the predicted text line according to the predicted horizontal boundary offset.
  • the text detection model is used to perform text detection on the image to be detected, and the text detection model is trained in advance, so that the text detection model can predict the predicted horizontal boundary offset corresponding to each predicted text line in the process of processing the image to be detected. Then, the actual horizontal boundary offset corresponding to each predicted text line is inversely derived according to the predicted horizontal boundary offset. For example, use the following formula to get the horizontal offset corresponding to each text line:
  • O represents the predicted horizontal direction offset regression target
  • x side represents the predicted value of the current subdivided text square relative to the left offset of the original unsplit text square.
  • text lines can be predicted horizontal offset O, And according to the regression formula of O, the actual horizontal offset x side of the predicted text line is obtained.
  • the prediction horizontal boundary offset O corresponding to each predicted text row can be obtained by using the 1K text row horizontal direction offset prediction task, and the horizontal offset calculation formula is used.
  • the actual horizontal offset x side corresponding to each predicted text line can be deduced.
  • the text behavior determined by the candidate text sub-region is preset to be a multiple of the anchor point width, but the width of the real text row is not necessarily a multiple of the anchor point width, and the horizontal direction of the text is predicted.
  • the difference between the true offset and the true value of the calibrated text line boundary is corrected, and the accuracy of the text region predicted by the image to be detected is improved.
  • the method before step S110, the method further includes:
  • Step S210 Acquire model training data, where the model training data includes a sample image region set of a preset size ratio.
  • the sample image region set refers to a set of sample image regions participating in the model training, and the sample image region may be obtained by randomly collecting a partial region in the image in the image library.
  • the image library includes a plurality of images, and a part of the image in the image library is randomly acquired, and a large amount of training data can be acquired, and the model is trained multiple times.
  • the aspect ratio of the obtained sample image region is scaled to a preset size, such as 600, to ensure the consistency of the sample image region size, and the feature extraction and analysis are facilitated.
  • the number of samples per model training data may be set to 128, and the ratio of positive and negative samples is 1:1.
  • a positive sample refers to a sample image area containing text
  • a negative sample refers to a sample image area that does not contain text.
  • Step S220 Feature extraction is performed on the sample image region set and input to the initialization neural network model, and the initialization neural network model is obtained by initializing the neural network model by using a Gaussian distribution random number of the mean and the variance.
  • the neural network model is initialized by the Gaussian distribution random number of the preset mean and variance, and the neural network model is optimized to obtain the initial neural network model.
  • the Gaussian distribution random number with a mean value of 0 is set to optimize the neural network model.
  • the neural network model is trained together with the feature extractor, and the feature extractor is used to extract the feature image region set, and the extracted feature is input into the neural network model for processing, and the global feature information of the sample image region can be obtained.
  • Feature extraction is performed on each sample image region in the sample image region set, and input to the initialization neural network model, so that the initialized neural network model processes the extracted features, obtains corresponding sequence information, and outputs a corresponding feature matrix.
  • Step S230 acquiring a feature matrix that initializes the output of the neural network model, and mapping the feature matrix to the corresponding sample image region through the fully connected layer to obtain a corresponding sample image sub-region.
  • the feature matrix output according to the initialization neural network model is mapped to the corresponding sample image region through the fully connected layer, each sample image region corresponds to one feature matrix, and the feature matrix is mapped to the corresponding sample image region to obtain a corresponding sample image region. region.
  • Step S240 Acquire text feature information corresponding to each sample image sub-region, and obtain a predicted text row according to the text feature information and the preset text clustering algorithm.
  • the text feature information corresponding to each sample image sub-region can be obtained by classification or regression, for example, by using the following formula to train the center point vertical direction offset corresponding to each prediction sample image sub-region.
  • v c represents the predicted value of the regression target of the vertical component of the center point of the text block
  • c y represents the vertical component of the center point of the predicted text block
  • h a represents the height of the corresponding preset anchor point region
  • v h represents the predicted value of the height regression target of the text block
  • h represents the height of the predicted text block
  • the true value of the regression target representing the vertical component of the center point of the text block The true value representing the vertical component of the center point of the block, Indicates the true value of the height of the square regression target
  • h * represents the true value of the height of the text square.
  • v c is the offset of the vertical direction of the center point of each sub-region of the predicted sample image predicted during the training of the model. Is used to predict the true value of the numerical component of the center point of the sample image sub-region, Each parameter in the pair is supervised and trained on each parameter in v c , and the v c value is made as close as possible during the training process. The value is such that when the image to be detected is detected, the offset of the center point in the vertical direction can be predicted, and the predicted offset is more accurate.
  • the predicted text line horizontal boundary offset is trained according to the following formula.
  • O represents the predicted horizontal direction offset regression target
  • x side represents the predicted value of the current subdivided text square relative to the left offset of the original unsplit text square.
  • w a current anchor point / text candidate area width is a fixed value
  • O * represents the current subdivision of the text square relative to the original uncut text square left offset of the return target true value
  • text feature information corresponding to each sample image region is obtained, and the text feature information includes text position information and text confidence, and each text is predicted according to text feature information of the sample image sub-region corresponding to each sample image region.
  • the text line area corresponding to the image sample area is obtained by training data of a set of sample image areas, and the text detection model parameters are adjusted according to the real data of the sample image area.
  • Step S250 the process proceeds to step S210, and the text detection model is optimized according to the preset potential energy term and the preset weight attenuation value, and the target text detection model is obtained according to the target optimization function.
  • the potential energy term is a parameter that maintains the stability of the model
  • the weight attenuation value is a parameter that prevents overfitting.
  • the model is optimized for SGD (Stochastic Gradient Descent) according to the preset potential energy term and weight attenuation.
  • the potential energy term is set to 0.9 and the weight attenuation is 0.0005.
  • Setting the preset potential energy item to prevent jitter during the training process can improve the stability during the model optimization process and avoid jumping at extreme points.
  • the objective function such as setting model optimization is as follows:
  • L(s i , v j , o k ) represents a global optimization objective function
  • s i represents the probability that the i-th anchor point is predicted as text
  • s i * indicates whether the i-th anchor point is the true value of the text
  • v j represents the first j anchor point vertical direction coordinate prediction value
  • v j * represents the true value of the jth anchor point vertical direction coordinate
  • o k represents the horizontal offset prediction value of the kth boundary anchor point relative boundary
  • o k * The true value of the horizontal offset representing the relative boundary of the kth boundary anchor.
  • ⁇ 1 and ⁇ 2 are the weighting tasks of the text localization task and the boundary optimization task, respectively.
  • N s , N v , and N o represent the number of anchors used for text classification, text localization, and boundary optimization tasks in each training batch.
  • the text detection model is optimized according to the target optimization function, and each parameter corresponding to the text detection model is determined, and the target text detection model after training is obtained, and the input image to be detected is subjected to text detection.
  • the text detection model by acquiring the sample image region as the model training data, using the text detection model to perform text detection on the sample image region, repeat the training process, and preset the potential energy term and the attenuation weight and the learning rate, and establish the target optimization function to the text.
  • the detection model is optimized, and the parameters of the text detection model are finally determined, and an optimized text detection model is obtained for text prediction of the actual image to be detected.
  • the text detection model is continuously trained and optimized, and the neural network model and the feature extractor are combined to train, and the extracted features are further processed to obtain global text information of the sample image region, thereby improving
  • the text detection model predicts the accuracy of the text region in the image to be detected.
  • FIG. 6 it is a schematic architecture diagram of a text detection method in one embodiment.
  • feature extraction is performed on the image to be detected using the 50-layer residual network 600, and the res4f feature 610 is obtained through feature extraction of the multi-layer convolution network, and the res4f feature is input to the bidirectional long- and short-term memory network LSTM620 to establish a sequence of text candidate regions, and then
  • the text candidate region sequence is subjected to feature mapping through the fully connected layer FC630, and predicts 2K vertical coordinate offset, 2K text confidence, and 1K boundary optimization value according to the mapping result, where K is the number of anchor points on each pixel on res4f.
  • the text candidate region position information is determined by predicting the vertical coordinate offset and the horizontal boundary offset, and determining whether the candidate region is a text region according to the predicted text confidence determines prediction of the text region in the image to be detected.
  • a text detection algorithm including the following:
  • Step S301 acquiring an image to be detected.
  • Step S302 performing feature extraction on the image to be detected to obtain a first feature matrix, and inputting the first feature matrix into the bidirectional long-term and short-term memory network model.
  • Step S303 Acquire a current position of the current sliding window matrix, and calculate a current convolution result of the current sliding window matrix and the first characteristic matrix according to the current position, where the current sliding window matrix includes a forward sliding window matrix and a backward sliding window matrix.
  • Step S304 The internal state value corresponding to the current position of the neural network model is obtained by using an activation function according to an internal state value of the neural network model corresponding to the previous position of the current sliding window matrix by the current convolution result.
  • Step S305 sliding the current sliding window matrix to obtain the next position, and proceeding to step S303 until the current sliding window matrix traverses the elements of the first characteristic matrix.
  • Step S306 processing internal state values corresponding to different current sliding window matrices at different positions to generate a current feature matrix, where the current feature matrix includes a forward feature matrix and a backward feature matrix.
  • Step S307 the forward feature matrix and the backward feature matrix are spliced to obtain a target feature matrix, and the target feature matrix is outputted to the fully connected layer, and the fully connected layer maps each element of the target feature matrix to be detected according to the anchor region of the preset width.
  • Step S308 acquiring text feature information of the predicted image sub-region, the text feature information including text confidence and text position information.
  • Step S309 performing non-maximum suppression on each predicted image sub-region according to the text confidence, and using the predicted image sub-region whose text confidence is greater than the preset text confidence as the candidate text sub-region.
  • Step S310 Acquire first text position information corresponding to the current candidate text sub-area, and acquire, according to the first text position information, a target that is less than the preset distance threshold and the vertical direction overlap degree is greater than the preset overlap degree.
  • Candidate text sub-region Acquire first text position information corresponding to the current candidate text sub-area, and acquire, according to the first text position information, a target that is less than the preset distance threshold and the vertical direction overlap degree is greater than the preset overlap degree.
  • Step S311 the target candidate text sub-region closest to the current candidate text sub-region is used as the adjacent candidate text sub-region.
  • Step S312 Acquire the next candidate text sub-region corresponding to the current candidate text sub-region as the current candidate text sub-region, and proceed to step S310 until the candidate text sub-region is traversed.
  • Step S313 the candidate text sub-region and the corresponding adjacent candidate text sub-region are connected into a corresponding predicted text row, and the predicted text row is subjected to boundary correction to determine a text region corresponding to the image to be detected.
  • feature extraction is first performed on the detected image, and then the extracted feature is input into the bidirectional long-term and short-term memory network model to obtain a target feature matrix, and the target feature matrix is mapped to the image to be detected according to the preset anchor point region through the fully connected layer.
  • feature extraction is performed on the detected image, and then the extracted features are extracted by the bidirectional long-term and short-term memory network model, which reduces the dimension of image processing, improves the computational efficiency, and maps the obtained target feature matrix to the image to be detected through the fully connected layer.
  • the corresponding image sub-region is obtained, the text feature information of the image sub-region is obtained, the segmentation processing of the image to be detected is implemented, the text feature of the image to be detected is detected by each predicted image sub-region, and the preset anchor region is fixed.
  • the width value is such that the obtained width value of the predicted image sub-area is fixed, and the text is detected in a small range.
  • the text generally changes less in a small range, the accuracy of the detection is improved, and the text clustering algorithm is used to The adjacent predicted image sub-regions generate corresponding text lines. Since the predicted image sub-regions are merged, even if there are spaces in the text, the adjacent predicted image sub-regions can be merged to merge the characters including the spaces into complete Characters that improve the robustness of text detection.
  • the various steps in the various embodiments of the present application are not necessarily performed in the order indicated by the steps. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be executed at different times, and the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps.
  • providing a text detecting apparatus includes:
  • the obtaining module 810 is configured to acquire an image to be detected.
  • the feature matrix generation module 820 is configured to input the image to be detected into the neural network model and output the target feature matrix.
  • the text sub-area obtaining module 830 is configured to input the target feature matrix to the fully connected layer, and the full connection layer maps each element of the target feature matrix to the predicted image sub-region corresponding to the image to be detected according to the preset anchor point region.
  • the text area determining module 840 is configured to acquire text feature information of the predicted image sub-area, and connect the predicted image sub-area into a corresponding predicted text line according to the text feature information of the predicted image sub-area, and determine that the image to be detected corresponds to Text area.
  • the text detecting device inputs the image to be detected into the neural network model to obtain a target feature matrix, and maps the target feature matrix to the image sub-region corresponding to the image to be detected according to the preset anchor point region through the fully connected layer, and the image sub-region The regions are connected into a predicted text line to determine the text region of the image to be detected.
  • the text feature information is used to reflect the text feature of the predicted image sub-region, and the segmentation process of the image to be detected is implemented, and the text features of the image to be detected are detected by the respective predicted image sub-regions, thereby realizing detection of the text in a small range, due to being small
  • the text in the range usually changes less, which improves the accuracy of the detection.
  • the text segmentation algorithm generates the corresponding predicted image sub-regions to generate corresponding text lines. Since the predicted image sub-regions are merged, even if there are spaces in the text, After merging adjacent prediction image sub-regions, characters containing spaces can be merged into complete characters, which improves the robustness of text detection.
  • the feature matrix generation module 820 is further configured to perform feature extraction on the image to be detected to obtain a first feature matrix, the elements in the first feature matrix are two-dimensional elements, and the first feature matrix is input into the bidirectional long-term and short-term memory network model.
  • a forward feature matrix and a backward feature matrix are obtained, and the forward feature matrix and the backward feature matrix are spliced to obtain a target feature matrix.
  • the feature matrix generation module 820 includes:
  • the convolution module 821 is configured to obtain a current position of the current sliding window matrix, and calculate a current convolution result of the current sliding window matrix and the first characteristic matrix according to the current position, where the current sliding window matrix includes a forward sliding window matrix and a backward sliding window. matrix.
  • the updating module 822 is configured to obtain an internal state value corresponding to the current position of the long-term and short-term memory network model by using an activation function according to an internal state value of the long-short-term memory network model corresponding to the previous position of the current sliding window matrix.
  • the first loop module 823 is configured to slide the current sliding window matrix to obtain the next position, and enter the step of acquiring the current position of the current sliding window matrix until the current sliding window matrix traverses the elements of the first characteristic matrix.
  • the generating module 824 is configured to process the internal state values corresponding to the respective current sliding window matrices at different positions to generate a current feature matrix.
  • the width value of the preset anchor point area is a fixed value
  • the text area determination 840 module includes:
  • the horizontal position determining module 841 is configured to acquire the horizontal position of each predicted image sub-region according to the width value of the preset anchor point region and the first dimension coordinate corresponding to each element of the target feature matrix.
  • the vertical position determining module 842 is configured to acquire a vertical direction prediction offset of each predicted image sub-region, and calculate the offset according to the vertical direction, the height value of the corresponding preset anchor point region, and the central coordinate numerical component. The predicted height value corresponding to each predicted image sub-region and the actual offset of the center point vertical direction are respectively obtained.
  • the text position information determining module 843 is configured to determine text position information of each predicted image sub-area according to the horizontal position, the predicted height value, and the actual offset of the center point vertical direction.
  • the textual feature information includes textual location information.
  • the text area determining module 840 includes:
  • the information obtaining module 840A is configured to obtain, as a candidate text sub-region, each of the predicted image sub-regions, and acquire first text position information corresponding to the current candidate text sub-region.
  • the adjacent area determining module 840B is configured to acquire, according to the first text position information, a target candidate text sub-area that is less than a preset distance threshold and the vertical direction overlap degree is greater than a preset overlap degree according to the first text position information, and the distance is current
  • the target candidate text sub-region closest to the candidate text sub-region is used as the adjacent candidate text sub-region.
  • the second loop module 840C is configured to acquire the next candidate text sub-region corresponding to the current candidate text sub-region as the current candidate text sub-region, and enter the step of acquiring the first text location information corresponding to the current candidate text sub-region until the candidate text is traversed Sub-area.
  • the text line generating module 840D is configured to connect the candidate text sub-region with the corresponding adjacent candidate text sub-region into a corresponding predicted text row.
  • the text feature information includes text confidence
  • the information acquiring module 840A is further configured to obtain a text confidence corresponding to each of the predicted image sub-regions, and perform non-maximum suppression on each of the predicted image sub-regions according to the text confidence.
  • a predicted image sub-region having a text confidence greater than a preset text confidence is obtained as a candidate text sub-region.
  • the text detecting apparatus further includes:
  • the correction module 850 is configured to obtain a predicted horizontal direction offset corresponding to each predicted text line, and correct a horizontal boundary of the predicted text line according to the predicted horizontal boundary offset.
  • the text detecting apparatus further includes:
  • the training data obtaining module 910 is configured to acquire model training data, where the model training data includes a sample image region set of a preset size ratio.
  • the training module 920 is configured to perform feature extraction on the sample image region set and input to the initialization neural network model, and initialize the neural network model to initialize the neural network model by using a Gaussian distribution random number of the preset mean and variance, and obtain the initialization neural network model output.
  • the feature matrix maps the feature matrix to the corresponding sample image region through the fully connected layer to obtain the corresponding sample image sub-region, and obtains the text feature information corresponding to each sample image sub-region, and obtains the text feature information according to the text feature information and the preset text clustering algorithm. Predict the line of text.
  • the optimization module 930 is configured to repeatedly enter the step of acquiring the model training data, optimize the text detection model according to the preset potential energy item and the preset weight attenuation value, and obtain the target text detection model according to the target optimization function.
  • Figure 14 is a diagram showing the internal structure of a computer device in one embodiment.
  • the computer device may specifically be a terminal.
  • the computer device includes the computer device including a processor, a memory, a network interface, an input device, and a display screen connected by a system bus.
  • the memory comprises a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and can also store computer readable instructions that, when executed by the processor, cause the processor to implement the text detection method.
  • the internal memory can also store computer readable instructions that, when executed by the processor, cause the processor to perform a text detection method.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing, and It can be an external keyboard, trackpad or mouse.
  • Figure 15 is a diagram showing the internal structure of a computer device in one embodiment.
  • the computer device may specifically be a server.
  • the computer device includes the computer device including a processor, a memory, and a network interface connected by a system bus.
  • the memory comprises a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and can also store computer readable instructions that, when executed by the processor, cause the processor to implement a text detection method.
  • the internal memory can also store computer readable instructions that, when executed by the processor, cause the processor to perform a text detection method.
  • FIG. 14 and FIG. 15 are only block diagrams of partial structures related to the solution of the present application, and do not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the computer device may include more or fewer components than those shown in the figures, or some components may be combined, or have different component arrangements.
  • the text detecting apparatus may be implemented in the form of a computer readable instruction, which may be run on a computer device as shown in FIG. 14 and FIG.
  • the dysfunctional storage medium can store various program modules constituting the text detecting device, such as the obtaining module 810, the feature matrix generating module 820, the text sub-region obtaining module 830, and the text region determining module 840 in FIG.
  • Each program module includes computer readable instructions for causing a computer device to perform the steps in the text detection method of various embodiments of the present application described in this specification, the processor in the computer device being capable of invoking a non-
  • Each program module of the text detecting device stored in the volatile storage medium runs a corresponding readable instruction to implement functions corresponding to each module of the text detecting device in this specification.
  • the computer device can acquire the image to be detected by the acquiring module 810 in the text detecting device as shown in FIG. 8, and input the image to be detected into the neural network model through the feature matrix generating module 820, and output the target feature matrix through the text sub-region.
  • the obtaining module 830 inputs the target feature matrix to the fully connected layer, and the fully connected layer maps each element of the target feature matrix to the predicted image sub-region corresponding to the image to be detected according to the preset anchor point region, and acquires the predicted image by using the text region determining module.
  • the text feature information of the sub-region is connected to the predicted image sub-region by the text clustering algorithm according to the text feature information of the predicted image sub-region, and the text region corresponding to the image to be detected is determined.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种文本检测方法,包括:计算机设备获取待检测图像;将待检测图像输入至神经网络模型,输出目标特征矩阵;将目标特征矩阵输入至全连接层,全连接层根据预设锚点区域将目标特征矩阵的各个元素映射到待检测图像对应的预测图像子区域;获取预测图像子区域的文本特征信息,根据预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定待检测图像对应的文本区域。

Description

文本检测方法、存储介质和计算机设备
本申请要求于2017年09月25日提交中国专利局,申请号为2017108749731,申请名称为“文本检测方法、装置、存储介质和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种文本检测方法、存储介质和计算机设备。
背景技术
为了满足一定的应用需求需要对图像中的目标物体进行检测,由于通常的物体具有完整的封闭边界,因此通常的目标物体检测方法通过直接预测目标物体所在的候选区域,并能够根据预测候选区域中的目标物体的一部分特征推测物体类别,实现对目标物体的检测。
但是,文字与通常的物体不同,文字的边界是随着笔画变化的,且文字之间可能存在空格,通过一部分文字较难确定文字的类型,因此,利用传统的目标物体检测算法容易由于文字空格造成错检、漏检,且由于不能根据预测候选区域中的部分文字预测整个文字的类别,导致文字定位精准度不高,检测鲁棒性较低。
发明内容
根据本申请提供的各种实施例,提供一种文本检测方法、存储介质和计算机设备。
一种文本检测方法,包括:
计算机设备获取待检测图像;
所述计算机设备将所述待检测图像输入至神经网络模型,输出目标特征矩 阵;
所述计算机设备将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
所述计算机设备获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定所述待检测图像对应的文本区域。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
获取待检测图像;
将所述待检测图像输入至神经网络模型,输出目标特征矩阵;
将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定所述待检测图像对应的文本区域。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
获取待检测图像;
将所述待检测图像输入至神经网络模型,输出目标特征矩阵;
将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确 定所述待检测图像对应的文本区域。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中文本检测方法的流程图;
图2为一个实施例中目标特征矩阵生成方法的流程图;
图3为另一个实施例中目标特征矩阵生成方法的流程图;
图3A为一个实施例中获取预测图像子区域的文本特征信息的流程图;
图4为一个实施例中预测文本行生成方法的流程图;
图5为一个实施例中文本检测模型训练方法的流程图;
图6为一个实施例中文本检测方法的原理架构图;
图7为一个具体实施例中文本检测方法的流程图;
图8为一个实施例中文本检测装置的结构框图;
图9为一个实施例中特征矩阵生成模块的结构框图;
图10为一个实施例中文本区域确定模块的结构框图;
图11为另一个实施例中文本区域确定模块的结构框图;
图12为另一个实施例中文本检测装置的结构框图;
图13为又一个实施例中文本检测装置的结构框图;
图14为一个实施例中计算机设备的内部结构图;
图15为另一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实 施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但除非特别说明,这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。
本申请实施例的文本检测方法可以应用于计算机设备中,计算机设备可以是独立的物理服务器或终端,也可以是多个物理服务器构成的服务器集群,可以是提供云服务器、云数据库、云存储和CDN等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱以及智能手表等,但并不局限于此。终端的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。触摸层和显示屏构成触控屏。
如图1所示,在一个实施例中,提供一种文本检测方法,包括以下内容:
步骤S110,获取待检测图像。
具体地,待检测图像是指待进行文本检测的图像。检测待检测图像中是否包含文本区域以及确定文本区域的位置。待检测图像可以是身份证、名片、广告图片、视频截图等各种类型的图像,待检测图像中的文字尺度可以是任意的。
步骤S120,将待检测图像输入至神经网络模型,输出目标特征矩阵。
具体地,可以将待检测图像输入至神经网络模型进行特征提取,并对提取的特征进行相应卷积处理得到对应的目标特征矩阵。进一步地,也可以预先将神经网络模型作为特征提取器对待检测图像进行特征提取,然后将提取的特征输入到不同的神经网络模型中输出目标特征矩阵。如使用残差网络对待处理图像进行特征提取,残差网络的层数可以根据需要任意设置,且一般层数增多提取的图像特征提高。也可以使用VGG19、Res50、ResNet101等其他网络结构对待检测图像进行特征提取。将提取的特征输入至记忆网络模型进行处理输出目标特征矩阵。
其中,输入的待检测图像的尺度是可以变化的,对待检测图像进行特征提取得到的特征维度也是变化的。目标特征矩阵可以看作是表征图像特征值的序列。
步骤S130,将目标特征矩阵输入至全连接层,全连接层根据预设锚点区域将目标特征矩阵的各个元素映射到待检测图像对应的预测图像子区域。
其中,全连接层是指一种卷积层,可由卷积操作实现,在卷积神经网络中起到“分类器”的作用,能够将特征映射到样本空间。锚点区域确定原图的映射范围,表示检测模型关注的区域,通过对锚点区域做多个尺度和宽高比变换,能够实现对多尺度和宽高比的文字的检测。在一个实施例中,预设锚点区域的宽度为固定值。将锚点区域的宽度设置为固定值可以实现在预设宽度的区域范围内对待检测图像进行检测,在较小的范围内水平方向文本变化较小,能够提高文本检测的精确度。预设锚点区域的高度值可以发生变化,如将高度值设置为7,11,18,25,35,56,67,88,100,168,278等,通过变化的高度值实现使用锚点区域能够尽可能覆盖实际场景下形状多变的目标。
具体地,全连接层根据预设锚点区域将目标特征矩阵的各个元素对应的特征映射到待检测图像,得到各个特征在待检测图像中对应的图像子区域。进一步地,当锚点区域的宽度值固定时,映射到原图的特征对应的图像子区域宽度固定,只需要对图像子区域的宽度进行预测即可确定各个图像子区域的位置信息,当预设锚点区域宽度固定时,只需要对图像子区域的高度值进行预测减少了模型优化的搜索空间。
进一步地,将目标特征矩阵的各个元素映射回原图得到对应的图像子区域,对各个图像子区域进行文本检测,实现了对待检测图像的切分,将一个原始图像切分为若干个图像子区域进行文本检测。
步骤S140,获取预测图像子区域的文本特征信息,根据预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定待检测图像对应的文本区域。
其中,文本特征信息是指反映文本属性的信息,文本属性包括图像内的文本位置、文本置信度,预测图像子区域的文本特征信息包括预测图像子区域的 文本位置信息和文本置信度,文本位置信息可以通过预测2K垂直坐标偏移量,1K文本行水平边界偏移量确定,其中K为预设的锚点数目,可以根据需要预先设置。文本检测模型经过训练后对待检测图像进行文本检测时,能够给出各个预测图像子区域对应的预测竖直方向偏移量,根据回归方程能够根据文本检测模型预测得到的竖直方向偏移量获取到各个预测图像子区域对应的实际竖直方向偏移量和高度值,从而确定各个预测图像子区域的文本位置信息。文本置信度是指预设图像子区域包含的内容为文本的概率。文本聚类算法是指能够实现将图像子区域连接成对应的文本行的算法或者预定义规则。如输入身份证的图片,能够获取得到身份证中各个文字的左上角和右下角坐标以及置信度。
具体地,根据预测图像子区域对应的文本位置信息和文本置信度,按照一定规则获取处于同一个文本行的图像子区域进行连接,将多个图像子区域连接成对应的文本行,将单个图像子区域连接成对应的文本行能够以行为单位从整体上确定待检测图像对应的文本区域,避免由于图像子区域位置提取到的文字中存在空格造成误检。
本实施例中,将待检测图像输入至神经网络模型得到目标特征矩阵,通过全连接层将目标特征矩阵根据预设锚点区域映射到待检测图像对应的图像子区域,将图像子区域连接成预测文本行从而确定待检测图像的文本区域。通过神经网络模型得到待处理图像对应的目标特征矩阵,并通过全连接层将目标特征矩阵的各个元素映射到待检测图像对应的位置得到对应的图像子区域,获取图像子区域的文本特征信息,使用文本特征信息反映预测图像子区域的文本特征,实现对待检测图像的切分处理,通过各个预测图像子区域检测待检测图像的文本特征,进一步地,根据预测图像子区域的文本特征信息和文本聚类算法,将相邻的预测图像子区域连接成对应的文本行,实现在较小的范围对文本进行检测,由于在小范围内文本通常变化比较小,提高了检测的精确度,通过文本聚类算法将相邻的预测图像子区域生成对应的文本行,由于对预测图像子区域进行了合并,因此,即使文字中存在空格,对相邻预测图像子区域进行合并后,能够将包含空格的字符合并成完整的字符,提高了文本检测的鲁棒性。
如图2所示,在一个实施例中,步骤S120包括:
步骤S121,对待检测图像进行特征提取得到第一特征矩阵,第一特征矩阵中的元素为二维元素。
具体地,利用残差网络作为多层卷积特征提取器对待检测图像进行特征提取,得到经过多层卷积得到的特征矩阵。提取得到的特征矩阵中的元素为二维元素,能够表征特征对应的位置。进一步地,使用的残差网络的层数可以根据需要设置,如设置为50层,使用Res50对待检测图像进行特征提取,一般残差网络层数增多提取的图像特征提高,但增加到一定层数如152层后效果提高逐渐不明显。
在其他实施例中,也可以使用VGG19、ResNet101等其他网络结构对待检测图像进行特征提取。
步骤S122,将第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵。
其中,长短期记忆网络模型是指LSTM(Long Short-Term Memory),一种时间递归神经网络。双向长短期记忆网络模型包括前向长短期记忆网络模型和后向长短期记忆网络模型。
具体地,对待检测图像进行特征提取后反映的是图像的局部信息,而一个单词或句子通常包括多个字符,且字符之间具有很强的关联性,为了反映图像的全局信息,将提取到的特征输入至LSTM中挖掘文字区域包含的序列信息,获取字符之间的关联关系。利用两个长短期记忆网络模型分别对左右两侧字符序列进行建模,形成完成的序列信息,利用特征矩阵反映对应的序列信息。具体地,分别将第一特征矩阵输入到前向长短期记忆网络模型和后向长短期记忆网络模型,前向长短期记忆网络模型对第一特征矩阵进行处理得到前向特征矩阵,前向特征矩阵反映前向序列信息,后向长短期记忆网络模型对第一特征矩阵进行处理得到后向特征矩阵,后向特征矩阵反映后向序列信息,序列信息表征特征元素对应的图像子区域之间的连接关系。
步骤S123,将前向特征矩阵和后向特征矩阵拼接得到目标特征矩阵。
具体地,将前向特征矩阵和后向特征矩阵拼接得到目标特征矩阵,由于前向特征矩阵反映前向序列信息,后向特征矩阵反映后向序列信息,因此目标特 征矩阵能够反映各个元素对应的图像子区域的序列信息,表征各个元素对应的图像子区域的连接关系。
本实施例中,预先对待检测图像进行特征提取,利用提取得到的特征进行处理得到目标特征矩阵,将对原图像的处理转化为对原图像对应的特征的处理,极大的降低了信息处理的维度,进一步地,图片共用特征提取层避免了重复计算的问题,提高了信息处理的效率。且使用双向长短期记忆网络模型分别对前向和后向序列信息进行提取,能够更完整的反映特征元素之间的关联关系,提高后续文本区域确定的准确性。
如图3所示,在一个实施例中,步骤S122包括:
步骤S122A,获取当前滑窗矩阵的当前位置,根据当前位置计算当前滑窗矩阵与第一特征矩阵的当前卷积结果,当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵。
其中,滑窗矩阵是指能够滑动,并且在滑动的各个位置与目标矩阵进行卷积的矩阵。滑窗矩阵可以为根据需要设置的卷积核,滑窗矩阵的尺度可以通过设置对应的滑窗尺度确定,如设置滑窗矩阵对应的滑窗尺度为3*3,则滑窗矩阵为3*3的矩阵。
具体地,由于将第一特征矩阵分别输入到了前向长短期记忆网络模型和后向长短期记忆网络模型,而前向和后向长短期记忆网络模型提取的特征不同,也就是与第一特征矩阵卷积结果不同,因此,在前向长短期记忆网络模型和后向长短期记忆网络模型分别设置不同的滑窗矩阵与第一特征矩阵进行卷积得到对应的目标特征矩阵。进一步地,滑窗矩阵在不同的位置与第一特征矩阵进行卷积得到不同的卷积结果,获取滑窗矩阵当前位置,并将滑窗矩阵处于当前位置时第一特征矩阵与滑动矩阵重叠的部分与滑窗矩阵进行卷积得到对应的卷积结果。
步骤S122B,利用激活函数根据当前卷积结果与当前滑窗矩阵的前一个位置对应的长短期记忆网络模型的内部状态值得到长短期记忆网络模型当前位置对应的内部状态值。
其中,激活函数是指用于更新神经网络参数的函数。利用滑窗矩阵在当前 位置时对应的卷积结果和前一个位置对应的神经网络模型的内部状态值计算得到长短期记忆网络模型当前位置对应的内部状态值。
具体地,使用sigmoid函数
Figure PCTCN2018107032-appb-000001
作为激活函数,表达式如下:
Figure PCTCN2018107032-appb-000002
利用激活函数周期性的更新长短期记忆网络模型当前位置对应的内部状态值H(t),
Figure PCTCN2018107032-appb-000003
其中,滑动窗口在t时刻对应的位置处与第一特征矩阵产生的卷积结果,H t-1表示t-1时刻长短期记忆模型的内部状态值。进一步地,若输入的长短期记忆网络模型为双向长短期记忆网络模型,且前向和后向长短期记忆网络模型内部状态维度为256维,则H(t)∈R 256,R表示实数集合。
步骤S122C,滑动当前滑窗矩阵得到下一个位置,进入步骤S122A,直至当前滑窗矩阵遍历第一特征矩阵的元素。
具体地,滑窗矩阵能够在第一特征矩阵上滑动,每次移动一个像素位置,滑窗矩阵滑动到每一个位置对应一个卷积结果,得到当前位置滑窗矩阵对应的神经网络模型的内部状态值后,滑动当前滑窗矩阵到下一个位置,进入步骤S122A,计算滑窗矩阵在滑动后当前所处的位置对应的神经网络模型的内部状态值,重复执行上述过程,直至当前滑窗矩阵遍历第一特征矩阵的元素,得到当前滑窗每一个位置对应的神经网络模型的内部状态值。特别地,若设置预设锚点区域的宽度为固定值如16,则滑窗矩阵在第一特征矩阵上滑动一个像素,对应于待检测图像中16个像素。
步骤S122D,将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵。
具体地,长短期记忆网络模型对应的内部状态值是长短期记忆网络模型对第一特征矩阵进行处理的中间结果,需要进一步地对内部状态值进行映射或卷积等处理生成对应的当前特征矩阵,当前特征矩阵包括前向特征矩阵和后向特征矩阵,将前向特征矩阵和后向特征矩阵拼接成目标特征矩阵输出。
本实施例中,分别在前向和后向长短期记忆网络模型中利用不同的滑窗矩 阵与第一特征矩阵进行卷积,分别在前向和后向长短期记忆网络模型中得到滑窗矩阵在各个位置对应的卷积结果,并使用激活函数计算各个位置长短期记忆网络模型对应的内部状态值,根据得到的内部状态值进行处理得到对应的当前特征矩阵。通过将滑窗矩阵在第一特征矩阵上进行卷积,避免了在原图进行滑窗带来的重复计算的问题,减少了滑窗操作消耗的时间和计算资源。
如图3A所示,在一个实施例中,预设锚点区域的宽度值为固定值,获取预测图像子区域的文本特征信息的步骤包括:
步骤S141,根据预设锚点区域的宽度值和目标特征矩阵的各个元素对应的第一维度坐标获取各个预测图像子区域的水平位置。
具体地,预设锚点区域的宽度值为固定值,宽度值可以经验设置,如设置为16像素。当预设锚点区域的宽度值确定时,通过全连接层映射到待检测图像的各个预测图像子区域的宽度值固定,且预设锚点区域位置固定,根据目标特征矩阵在全连接层中的位置即可确定根据预设锚点区域映射到原图中的水平位置。
步骤S142,获取各个预测图像子区域的竖直方向预测偏移量,根据竖直方向预测偏移量、对应的预设锚点区域的高度值和中心坐标数值分量进行计算,分别得到各个预测图像子区域对应的预测高度值和中心点竖直方向实际偏移量。
具体地,使用文本检测模型对待检测图像进行文本检测,预先对文本检测模型进行训练,使得文本检测模型在对待检测图像处理的过程中能够预测得到各个预测图像子区域对应的预测中心点竖直分量,然后根据预测中心点竖直分量反推得到各个图像子区域对应的预测高度值和实际中心点竖直分量。如利用下述公式进行计算:
Figure PCTCN2018107032-appb-000004
其中,v c表示文本方块中心点竖直分量的回归目标的预测值,c y表示预测的文本方块中心点的竖直分量,
Figure PCTCN2018107032-appb-000005
表示对应的预设锚点区域的中心坐标的竖直分量,h a表示对应预设锚点区域的高度,v h表示文本方块的高度回归目标的预 测值,h表示预测的文本方块的高度。
获取得到待检测图像对应的预测图像子区域后,能够根据2K竖直坐标分量偏移量预测任务得到各个预测文本行对应的预测水平边界偏移量v c,利用上述水平偏移量计算公式,能够反推得到各个预测文本行对应的实际中心点竖直分量c y、预测图像子区域的高度h
步骤S143,根据水平位置、预测高度值和中心点竖直方向实际偏移量确定各个预测图像子区域的文本位置信息。
具体地,根据得到的预测图像子区域的水平位置,预测高度值和中心点竖直方向偏移量,并根据预设锚点区域的位置能够确定各个预测图像子区域在待检测图像中对应的坐标,从而确定各个预测图像子区域的文本位置信息。
本实施例中,通过模型预测得到各个预测图像子区域对应的水平位置、高度值以及水平方向偏移量,确定各个预测图像子区域在待检测图像中的坐标,从而确定各个预测图像子区域的文本位置信息,为后续对预测图像子区域进行连接组成文本行提供依据,且预设锚点区域的宽度值为固定值,在预设的水平范围内检测出的文本可行度更高,进一步地,只需要对预测图像子区域的高度值进行预测,减少了模型优化的搜索空间。
如图4所示,在一个实施例中,文本特征信息包括文本位置信息,步骤S140包括:
步骤S140A,将各个预测图像子区域作为候选文本子区域,获取当前候选文本子区域对应的第一文本位置信息。
具体地,候选文本子区域是指待检测图像中预测为文本的子区域,将根据目标特征矩阵映射到原图的预测图像子区域作为候选文本子区域。根据获取的预测图像子区域对应的文本特征信息,获取当前候选文本子区域对应的第一文本位置信息。当前候选文本子区域可以是任意选择的一个候选文本子区域,获取该文本子区域对应的位置信息。
步骤S140B,根据第一文本位置信息获取与当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,将距离当前候选文本子区域最近的目标候选文本子区域作为相邻候选文本子区域。
具体地,由于每一个独立的预测图像子区域代表了图像每一位置的特征,一个图像子区域对应的文本可能不是完整的文本,需要将相邻的文本组合在一起才能准确的预测完整的文本信息。文本一般以文本行为单位,处于同一文本行的两个相邻的图像子区域距离较近,因此,通过设置水平方向和竖直方向的条件获取当前候选文本子区域对应的相邻候选文本子区域。
预先设置两个候选文本子区域在水平方向的距离阈值,距离阈值可以根据经验设定也可以根据各个预测图像子区域的位置信息设定。预先设置两个候选文本子区域在竖直方向上的重叠度,由于处于同一文本行的文本子区域基本位于同一直线上,在竖直方向应该有较高的重叠度,可以根据经验设定重叠度的值。如预先设置水平方向的距离阈值为50像素,竖直方向上的重叠度为0.7。
获取与当前候选文本子区域的水平距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,然后在目标候选文本子区域中选择水平方向距离当前候选文本子区域最近的目标候选文本子区域作为当前候选文本子区域的相邻候选文本子区域。
步骤S140C,获取当前候选文本子区域对应的下一个候选文本子区域,将下一个候选文本子区域作为当前候选文本子区域,进入获取当前候选文本子区域对应的第一文本位置信息的步骤,直至遍历候选文本子区域。
具体地,依次将每一个候选文本子区域作为当前候选文本子区域,重复确定相邻候选文本子区域的过程,直至确定每一个候选文本子区域对应的相邻候选文本子区域。
步骤S140D,将当前候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行。
具体地,将每一个候选文本子区域与对应的相邻候选文本子区域连接,实现同一个文本行对应的候选文本子区域能够相互连接得到对应的文本行区域。从而,以行为单位确定待检测图像中的文本区域。
本实施例中,通过预设条件获取每一个候选文本子区域对应的相邻候选文本子区域,将每一个候选文本子区域与相邻候选文本子区域相连预测得到待检测图像对应的文本行,以行为单位反映待检测图像的文本区域,避免了由于单 个候选文本子区域得到的文本信息不完整的问题,能够更加精确的反映待检测图像的文本区域。
在一个实施例中,文本特征信息包括文本置信度,将各个预测图像子区域作为候选文本子区域的步骤包括:获取各个预测图像子区域对应的文本置信度;根据文本置信度对各个预测图像子区域进行非极大值抑制,得到文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
具体地,得到预测图像子区域后,检测器检测各个预测图像子区域对应的文本置信度,判定每一个预测图像子区域为文本子区域的概率,由于通过全连接层映射到待检测图像,每一个锚点区域对应的预测图像子区域可能有很多个,为了能够更好的生成对应的文本行,对预测图像子区域进行筛选,获取文本置信度大于预设文本置信度的预测图像子区域,如设置预设文本置信度为0.7,根据预测图像子区域对应的文本置信度获取文本置信度大于0.7的预测图像子区域,将满足条件的预测图像子区域作为候选文本子区域,进行后续连接成文本行的操作。
本实施例中,在根据预设条件获取各个预测图像子区域对应的相邻图像子区域之前,预先根据文本置信度对预测图像子区域进行筛选,将文本置信度超过预设文本置信度的预测图像子区域作为候选文本子区域,减少获取相邻文本子区域的计算时间,提高预测得到的文本行的准确度,提高后续文本识别结果的准确性。
在一个实施例中,步骤S140之后还包括:获取各个预测文本行对应的预测水平方向偏移量,根据所述预测水平边界偏移量修正所述预测文本行的水平边界。
具体地,使用文本检测模型对待检测图像进行文本检测,预先对文本检测模型进行训练,使得文本检测模型在对待检测图像处理的过程中能够预测得到各个预测文本行对应的预测水平边界偏移量,然后根据预测水平边界偏移量反推得到各个预测文本行对应的实际水平边界偏移量。例如,利用下述公式获取各个文本行对应的水平方向偏移量:
Figure PCTCN2018107032-appb-000006
其中,O表示预测的水平方向偏移量回归目标,x side表示当前细分的文本方块相对于原始未切分文本方块的左侧偏移量的预测值,
Figure PCTCN2018107032-appb-000007
表示对应锚点中心点水平分量,w a当前锚点/文本候选区的宽度为固定值,具体地,文本检测模型经过训练后具备边界预测能力,能够预测得到文本行水平方向偏移量O,并根据O的回归公式得到预测文本行实际水平偏移量x side
获取得到待检测图像对应的预测图像子区域后,能够通过1K文本行水平方向偏移量预测任务获取得到各个预测文本行对应的预测水平边界偏移量O,利用上述水平偏移量计算公式,能够反推得到各个预测文本行对应的实际水平偏移量x side
本实施例中,由于预设锚点宽度,使得候选文本子区域确定的文本行为预设锚点宽度的倍数,但真实文本行的宽度不一定均是锚点宽度的倍数,通过预测文本水平方向真实偏移量与标定的文本行边界真值的差值修正误差,提高待检测图像预测的文字区域的准确性。
如图5所示,在一个实施例中,步骤S110之前还包括:
步骤S210,获取模型训练数据,模型训练数据包括预设尺寸比例的样本图像区域集合。
其中,样本图像区域集合是指参与模型训练的样本图像区域的集合,样本图像区域可以是通过随机在图像库中随机采集图像中的部分区域得到的。图像库中包括多个图像,随机采集图像库中的图像的部分区域,能够获取大量的训练数据,多次对模型进行训练。
具体地,在对模型进行训练时,将得到的样本图像区域的宽高比缩放到预设尺寸,如600,保证样本图像区域尺寸一致性,便于对特征进行提取分析。进一步地,可以设置每次模型训练数据的样本数目为128,正负样本的比例为1:1。正样本是指包含文字的样本图像区域,负样本是指不包含文字的样本图像区域。
步骤S220,对样本图像区域集合进行特征提取并输入至初始化神经网络模型,初始化神经网络模型通过预设均值和方差的高斯分布随机数初始化神经网 络模型得到。
其中,通过预设均值和方差的高斯分布随机数初始化神经网络模型,对神经网络模型进行优化得到初始化神经网络模型,如设置均值为0方差为0.001的高斯分布随机数对神经网络模型进行优化。
具体地,将神经网络模型与特征提取器一起训练,利用特征提取器对样本图像区域集合进行特征提取,并将提取到的特征输入至神经网络模型进行处理,能够得到样本图像区域的全局特征信息。对样本图像区域集合中的每个样本图像区域进行特征提取,并输入至初始化神经网络模型,以使初始化神经网络模型对提取到的特征进行处理,得到对应的序列信息,输出对应的特征矩阵。
步骤S230,获取初始化神经网络模型输出的特征矩阵,将特征矩阵通过全连接层映射到对应的样本图像区域得到对应的样本图像子区域。
具体地,根据初始化神经网络模型输出的特征矩阵通过全连接层映射到对应的样本图像区域,每个样本图像区域对应一个特征矩阵,特征矩阵映射到对应的样本图像区域,得到对应的样本图像子区域。
步骤S240,获取各个样本图像子区域对应的文本特征信息,并根据文本特征信息和预设文本聚类算法得到预测文本行。
具体地,可以通过分类或回归获取各个样本图像子区域对应的文本特征信息,如利用下述公式对预测各个预测样本图像子区域对应的中心点竖直方向偏移量进行训练。
Figure PCTCN2018107032-appb-000008
其中,v c表示文本方块中心点竖直分量的回归目标的预测值,c y表示预测的文本方块中心点的竖直分量,
Figure PCTCN2018107032-appb-000009
表示对应的预设锚点区域的中心坐标的竖直分量,h a表示对应预设锚点区域的高度,v h表示文本方块的高度回归目标的预测值,h表示预测的文本方块的高度,
Figure PCTCN2018107032-appb-000010
表示文本方块中心点竖直分量的回归目标的真值,
Figure PCTCN2018107032-appb-000011
表示方块中心点竖直分量的真值,
Figure PCTCN2018107032-appb-000012
表示方块的高度回归目标的真值,h *表示文本方块的高度的真值。
v c是模型训练过程中预测得到的各个预测样本图像子区域中心点竖直方向 的偏移量,
Figure PCTCN2018107032-appb-000013
是用来预测样本图像子区域中心点数值分量的真值,通过
Figure PCTCN2018107032-appb-000014
中的各个参数对v c中的各个参数就行监督训练,在训练过程使得v c值尽可能接近
Figure PCTCN2018107032-appb-000015
的值,以使在对待检测图像进行检测时,能够预测中心点竖直方向的偏移量,且预测得到的偏移量较为准确。
在一个实施例中,根据下述公式对预测文本行水平边界偏移量进行训练。
Figure PCTCN2018107032-appb-000016
其中,O表示预测的水平方向偏移量回归目标,x side表示当前细分的文本方块相对于原始未切分文本方块的左侧偏移量的预测值,
Figure PCTCN2018107032-appb-000017
表示对应锚点中心点水平分量,w a当前锚点/文本候选区的宽度为固定值,O *表示当前细分的文本方块相对于原未切分文本方块左侧偏移量的回归目标真值,
Figure PCTCN2018107032-appb-000018
表示当前细分的文本方块相对于原始未切分文本方块的左侧的偏移量的真值。
预测各个预测图像子区域组成的文本行的水平边界偏移量O,并通过标定的真值O *进行监督训练,通过不断的训练使得O的值尽可能的接近O *的值,以使在对待检测图像进行检测时,能够预测到较为准确的文本行水平边界偏移量。
具体地,根据上述训练过程,获取各个样本图像区域对应的文本特征信息,文本特征信息包括文本位置信息和文本置信度,根据各个样本图像区域对应的样本图像子区域的文本特征信息,预测得到各个图像样本区域对应的文本行区域,得到一组样本图像区域训练的数据,根据样本图像区域的真实数据调整文本检测模型参数。
步骤S250,重复进入步骤S210,根据预设势能项和预设权重衰减值对文本检测模型进行优化训练,根据目标优化函数得到目标文本检测模型。
具体地,势能项是维持模型稳定的一个参数,权重衰减值是防止过拟合的一个参数。重复对文本检测模型进行训练,如在完成一次训练后,再次随机获取对应的样本图像区域集合作为训练数据,设置初始学习率,学习率是指模型参数的更新系统,新模型参数需要旧模型参数乘以学习率计算得到,学习率可以根据经验设置,如设置为0.001,对模型训练90000次后将迭代至0.0001,然后再进行10000次迭代训练,更新模型参数。
根据预设势能项和权重衰减对模型进行SGD(Stochastic Gradient Descent,随机梯度下降)优化,如设置势能项为0.9,权重衰减为0.0005。设置预设势能项防止训练过程中的抖动,能够提高模型优化过程中的稳定性,避免出现在极端点上跳转。具体地,如设置模型优化的目标函数如下:
Figure PCTCN2018107032-appb-000019
其中,L(s i,v j,o k)表示全局优化目标函数,
Figure PCTCN2018107032-appb-000020
分别表示文本分类、文本定位、边界优化任务的损失函数,s i表示第i个锚点被预测为文本的概率,s i *表示第i个锚点是否为文本的真值,v j表示第j个锚点竖直方向坐标预测值,v j *表示第j个锚点竖直方向坐标的真值;o k表示第k个边界锚点相对边界的水平偏移量预测值,o k *表示第k个边界锚点相对边界的水平偏移量真值。θ 1和θ 2分别为文本定位任务,边界优化任务的损失权重。N s、N v、N o分别表示每个训练批次中文本分类,文本定位,边界优化任务用到的锚点数目。
根据目标优化函数对文本检测模型进行优化,确定文本检测模型对应的各个参数,得到训练后的目标文本检测模型,对输入的待检测图像进行文本检测。
本实施例中,通过获取样本图像区域作为模型训练数据,使用文本检测模型对样本图像区域进行文本检测,不断重复训练过程,并预设势能项和衰减权重以及学习率,建立目标优化函数对文本检测模型进行优化,最终确定文本检测模型的参数,得到优化后的文本检测模型,用于对实际的待检测图像进行文本预测。通过大量的训练数据以及目标优化函数不断的训练和优化文本检测模型,并且将神经网络模型和特征提取器结合训练,对提取得到的特征进行进一步处理,获取样本图像区域的全局文本信息,提高了文本检测模型预测待检测图像中文本区域的准确性。
如图6所示,为一个实施例中,文本检测方法的原理架构图。首先,将使用50层的残差网络600对待检测图像进行特征提取,经过多层卷积网络特征提取得到res4f特征610,将res4f特征输入至双向长短期记忆网络LSTM620建立文本候选区序列,然后将文本候选区序列经过全连接层FC630进行特征映射,并根据映射结果预测2K垂直坐标偏移量,2K文本置信度以及1K边界优化值,其中K为res4f 上每个像素上的锚点数目。
通过预测竖直坐标偏移量和水平边界偏移量确定文本候选区位置信息,根据预测的文本置信度确定候选区域是否为文本区域,实现对待检测图像中文本区域的预测。
如图7所示,在一个具体实施例中,提供一种文本检测算法,包括以下内容:
步骤S301,获取待检测图像。
步骤S302,对待检测图像进行特征提取得到第一特征矩阵,将第一特征矩阵输入双向长短期记忆网络模型。
步骤S303,获取当前滑窗矩阵的当前位置,根据当前位置计算当前滑窗矩阵与第一特征矩阵的当前卷积结果,当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵。
步骤S304,利用激活函数根据当前卷积结果与当前滑窗矩阵的前一个位置对应的神经网络模型的内部状态值得到神经网络模型当前位置对应的内部状态值。
步骤S305,滑动当前滑窗矩阵得到下一个位置,进入步骤S303,直至当前滑窗矩阵遍历第一特征矩阵的元素。
步骤S306,将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵,当前特征矩阵包括前向特征矩阵和后向特征矩阵。
步骤S307,将前向特征矩阵和后向特征矩阵拼接得到目标特征矩阵,输出目标特征矩阵至全连接层,全连接层根据预设宽度的锚点区域将目标特征矩阵的各个元素映射到待检测图像对应的预测图像子区域。
步骤S308,获取预测图像子区域的文本特征信息,文本特征信息包括文本置信度和文本位置信息。
步骤S309,根据文本置信度对各个预测图像子区域进行非极大值抑制,将文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
步骤S310,获取当前候选文本子区域对应的第一文本位置信息,根据第一文本位置信息获取与当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域。
步骤S311,将距离当前候选文本子区域最近的目标候选文本子区域作为相邻候选文本子区域。
步骤S312,获取当前候选文本子区域对应的下一个候选文本子区域作为当前候选文本子区域,进入步骤S310,直至遍历候选文本子区域。
步骤S313,将候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行,对预测文本行进行边界修正,确定待检测图像对应的文本区域。
本实施例中,首先对待检测图像进行特征提取,然后将提取的特征输入至双向长短期记忆网络模型得到目标特征矩阵,通过全连接层将目标特征矩阵根据预设锚点区域映射到待检测图像对应的图像子区域,并根据图像子区域的文本位置信息和文本置信度确定候选文本子区域,并选取候选文本子区域的相邻子区域,将相邻的候选文本子区域相连生成预测文本行从而确定待检测图像的文本区域。首先对待检测图像进行特征提取然后再通过双向长短期记忆网络模型对提取的特征进行提取,降低了图像处理的维度,提高了计算效率,将得到的目标特征矩阵通过全连接层映射到待检测图像对应的位置得到对应的图像子区域,获取图像子区域的文本特征信息,实现对待检测图像的切分处理,通过各个预测图像子区域检测待检测图像的文本特征,并且预设锚点区域为固定宽度值,使得获取的预测图像子区域的宽度值固定,在较小的范围对文本进行检测,由于在小范围内文本通常变化比较小,提高了检测的精确度,通过文本聚类算法将相邻的预测图像子区域生成对应的文本行,由于对预测图像子区域进行了合并,因此,即使文字中存在空格,对相邻预测图像子区域进行合并后,能够将包含空格的字符合并成完整的字符,提高了文本检测的鲁棒性。
应该理解的是,虽然本申请各实施例中的各个步骤并不是必然按照步骤标号指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
如图8所示,在一个实施例中,提供一种文本检测装置包括:
获取模块810,用于获取待检测图像。
特征矩阵生成模块820,用于将待检测图像输入至神经网络模型,输出目标特征矩阵。
文本子区域获取模块830,用于将目标特征矩阵输入至全连接层,全连接层根据预设锚点区域将目标特征矩阵的各个元素映射到待检测图像对应的预测图像子区域。
文本区域确定模块840,用于获取预测图像子区域的文本特征信息,根据预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定待检测图像对应的文本区域。
本实施例中,文本检测装置将待检测图像输入至神经网络模型得到目标特征矩阵,通过全连接层将目标特征矩阵根据预设锚点区域映射到待检测图像对应的图像子区域,将图像子区域连接成预测文本行从而确定待检测图像的文本区域。通过神经网络模型得到待处理图像对应的目标特征矩阵,并通过全连接层将目标特征矩阵的各个元素映射到待检测图像对应的位置得到对应的图像子区域,获取图像子区域的文本特征信息,使用文本特征信息反映预测图像子区域的文本特征,实现对待检测图像的切分处理,通过各个预测图像子区域检测待检测图像的文本特征,实现在较小的范围对文本进行检测,由于在小范围内文本通常变化比较小,提高了检测的精确度,通过文本聚类算法将相邻的预测图像子区域生成对应的文本行,由于对预测图像子区域进行了合并,即使文字中存在空格,对相邻预测图像子区域进行合并后,能够将包含空格的字符合并成完整的字符,提高了文本检测的鲁棒性。
在一个实施例中,特征矩阵生成模块820还用于对待检测图像进行特征提取得到第一特征矩阵,第一特征矩阵中的元素为二维元素,将第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵,将前向特征矩阵和后向特征矩阵拼接得到目标特征矩阵。
如图9所示,在一个实施例中,特征矩阵生成模块820包括:
卷积模块821,用于获取当前滑窗矩阵的当前位置,根据当前位置计算当前 滑窗矩阵与第一特征矩阵的当前卷积结果,当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵。
更新模块822,用于利用激活函数根据当前卷积结果与当前滑窗矩阵的前一个位置对应的长短期记忆网络模型的内部状态值得到长短期记忆网络模型当前位置对应的内部状态值。
第一循环模块823,用于滑动当前滑窗矩阵得到下一个位置,进入获取当前滑窗矩阵的当前位置的步骤,直至当前滑窗矩阵遍历第一特征矩阵的元素。
生成模块824,用于将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵。
如图10所示,在一个实施例中,预设锚点区域的宽度值为固定值,文本区域确定840模块包括:
水平位置确定模块841,用于根据预设锚点区域的宽度值和目标特征矩阵的各个元素对应的第一维度坐标获取各个预测图像子区域的水平位置。
竖直位置确定模块842,用于获取各个预测图像子区域的竖直方向预测偏移量,根据竖直方向预测偏移量、对应的预设锚点区域的高度值和中心坐标数值分量进行计算,分别得到各个预测图像子区域对应的预测高度值和中心点竖直方向实际偏移量。
文本位置信息确定模块843,用于根据水平位置、预测高度值和中心点竖直方向实际偏移量确定各个预测图像子区域的文本位置信息。
如图11所示,在一个实施例中,文本特征信息包括文本位置信息。文本区域确定模块840包括:
信息获取模块840A,用于将各个预测图像子区域作为候选文本子区域,获取当前候选文本子区域对应的第一文本位置信息。
相邻区域确定模块840B,用于根据第一文本位置信息获取与当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,将距离当前候选文本子区域最近的目标候选文本子区域作为相邻候选文本子区域。
第二循环模块840C,用于获取当前候选文本子区域对应的下一个候选文本 子区域作为当前候选文本子区域,进入获取当前候选文本子区域对应的第一文本位置信息的步骤,直至遍历候选文本子区域。
文本行生成模块840D,用于将候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行。
在一个实施例中,文本特征信息包括文本置信度,信息获取模块840A还用于获取各个预测图像子区域对应的文本置信度,根据文本置信度对各个预测图像子区域进行非极大值抑制,得到文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
如图12所示,在一个实施例中,文本检测装置还包括:
修正模块850,用于获取各个预测文本行对应的预测水平方向偏移量,根据预测水平边界偏移量修正预测文本行的水平边界。
如图13所示,在一个实施例中,文本检测装置还包括:
训练数据获取模块910,用于获取模型训练数据,模型训练数据包括预设尺寸比例的样本图像区域集合。
训练模块920,用于对样本图像区域集合进行特征提取并输入至初始化神经网络模型,初始化神经网络模型通过预设均值和方差的高斯分布随机数初始化神经网络模型得到,获取初始化神经网络模型输出的特征矩阵,将特征矩阵通过全连接层映射到对应的样本图像区域得到对应的样本图像子区域,获取各个样本图像子区域对应的文本特征信息,并根据文本特征信息和预设文本聚类算法得到预测文本行。
优化模块930,用于重复进入获取模型训练数据的步骤,根据预设势能项和预设权重衰减值对文本检测模型进行优化训练,根据目标优化函数得到目标文本检测模型。
图14示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端。如图14所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、输入装置和显示屏。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理 器实现文本检测方法。该内存储器中也可储存有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行文本检测方法。计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
图15示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是服务器。如图15所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器以及网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器实现文本检测方法。该内存储器中也可储存有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行文本检测方法。
本领域技术人员可以理解,图14、图15中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的文本检测装置可以实现为一种计算机可读指令的形式,计算机可读指令可在如图14以及图15所示的计算机设备上运行,计算机设备的非易失性存储介质可存储组成该文本检测装置的各个程序模块,比如图8中的获取模块810、特征矩阵生成模块820、文本子区域获取模块830及文本区域确定模块840。各个程序模块中包括计算机可读指令,计算机可读指令用于使计算机设备执行本说明书中描述的本申请各个实施例的文本检测方法中的步骤,计算机设备中的处理器能够调用计算机设备的非易失性存储介质中存储的文本检测装置的各个程序模块,运行对应的可读指令,实现本说明书中文本检测装置的各个模块对应的功能。例如,计算机设备可以通过如图8所示的文本检测装置中的获取模块810获取待检测图像,通过特征矩阵生成模块820将待检测图像输入至神经网络模型,输出目标特征矩阵,通过文本子区域获取模块 830将目标特征矩阵输入至全连接层,全连接层根据预设锚点区域将目标特征矩阵的各个元素映射到待检测图像对应的预测图像子区域,并通过文本区域确定模块获取预测图像子区域的文本特征信息,根据预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定待检测图像对应的文本区域。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机指令来指令相关的硬件来完成,所述的计算机指令可存储于一非易失性计算机可读取存储介质中,该计算机指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (24)

  1. 一种文本检测方法,所述方法包括:
    计算机设备获取待检测图像;
    所述计算机设备将所述待检测图像输入至神经网络模型,输出目标特征矩阵;
    所述计算机设备将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
    所述计算机设备获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定所述待检测图像对应的文本区域。
  2. 根据权利要求1所述的方法,其特征在于,所述计算机设备将所述待检测图像输入至神经网络模型,输出目标特征矩阵包括:
    所述计算机设备对所述待检测图像进行特征提取得到第一特征矩阵,所述第一特征矩阵中的元素为二维元素;
    所述计算机设备将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵;
    所述计算机设备将所述前向特征矩阵和后向特征矩阵拼接得到所述目标特征矩阵。
  3. 根据权利要求2所述的方法,其特征在于,所述计算机设备将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵,包括:
    所述计算机设备获取当前滑窗矩阵的当前位置,根据当前位置计算所述当前滑窗矩阵与所述第一特征矩阵的当前卷积结果,所述当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵;
    所述计算机设备利用激活函数根据所述当前卷积结果与当前滑窗矩阵的 前一个位置对应的长短期记忆网络模型的内部状态值得到所述长短期记忆网络模型当前位置对应的内部状态值;
    所述计算机设备滑动当前滑窗矩阵得到下一个位置,进入所述获取当前滑窗矩阵的当前位置的步骤,直至所述当前滑窗矩阵遍历所述第一特征矩阵的元素;
    所述计算机设备将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵。
  4. 根据权利要求1所述的方法,其特征在于,所述预设锚点区域的宽度值为固定值;所述计算机设备获取所述预测图像子区域的文本特征信息包括:
    所述计算机设备根据所述预设锚点区域的宽度值和所述目标特征矩阵的各个元素对应的第一维度坐标获取各个预测图像子区域的水平位置;
    所述计算机设备获取各个预测图像子区域的竖直方向预测偏移量,根据所述竖直方向预测偏移量、对应的预设锚点区域的高度值和中心坐标数值分量进行计算,分别得到各个预测图像子区域对应的预测高度值和中心点竖直方向实际偏移量;
    所述计算机设备根据所述水平位置、预测高度值和中心点竖直方向实际偏移量确定各个预测图像子区域的文本位置信息。
  5. 根据权利要求1所述的方法,其特征在于,所述文本特征信息包括文本位置信息;所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行,包括:
    所述计算机设备将各个预测图像子区域作为候选文本子区域,获取当前候选文本子区域对应的第一文本位置信息;
    所述计算机设备根据所述第一文本位置信息获取与所述当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,将距离所述当前候选文本子区域最近的所述目标候选文本子区域作为相邻候选文本子区域;
    所述计算机设备获取所述当前候选文本子区域对应的下一个候选文本子 区域作为当前候选文本子区域,进入所述获取当前候选文本子区域对应的第一文本位置信息的步骤,直至遍历候选文本子区域;
    所述计算机设备将候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行。
  6. 根据权利要求5所述的方法,其特征在于,所述文本特征信息包括文本置信度;所述计算机设备将各个预测图像子区域作为候选文本子区域包括:
    所述计算机设备获取各个预测图像子区域对应的文本置信度;
    所述计算机设备根据所述文本置信度对各个预测图像子区域进行非极大值抑制,得到文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行之后,还包括:
    所述计算机设备获取各个预测文本行对应的预测水平方向偏移量,根据所述预测水平边界偏移量修正所述预测文本行的水平边界。
  8. 根据权利要求1所述的方法,其特征在于,在所述计算机设备获取待检测图像之前还包括:
    所述计算机设备获取模型训练数据,所述模型训练数据包括预设尺寸比例的样本图像区域集合;
    所述计算机设备对所述样本图像区域集合进行特征提取并输入至初始化神经网络模型,所述初始化神经网络模型通过预设均值和方差的高斯分布随机数初始化神经网络模型得到;
    所述计算机设备获取初始化神经网络模型输出的特征矩阵,将特征矩阵通过全连接层映射到对应的样本图像区域得到对应的样本图像子区域;
    所述计算机设备获取各个样本图像子区域对应的文本特征信息,并根据文本特征信息和预设文本聚类算法得到预测文本行;
    所述计算机设备重复进入获取模型训练数据的步骤,根据预设势能项和 预设权重衰减值对文本检测模型进行优化训练,根据目标优化函数得到目标文本检测模型。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
    获取待检测图像;
    将所述待检测图像输入至神经网络模型,输出目标特征矩阵;
    将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
    获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定所述待检测图像对应的文本区域。
  10. 根据权利要求9所述的计算机设备,其特征在于,所述将所述待检测图像输入至神经网络模型,输出目标特征矩阵包括:
    对所述待检测图像进行特征提取得到第一特征矩阵,所述第一特征矩阵中的元素为二维元素;
    将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵;
    将所述前向特征矩阵和后向特征矩阵拼接得到所述目标特征矩阵。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵包括:
    获取当前滑窗矩阵的当前位置,根据当前位置计算所述当前滑窗矩阵与所述第一特征矩阵的当前卷积结果,所述当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵;
    利用激活函数根据所述当前卷积结果与当前滑窗矩阵的前一个位置对应 的长短期记忆网络模型的内部状态值得到所述长短期记忆网络模型当前位置对应的内部状态值;
    滑动当前滑窗矩阵得到下一个位置,进入所述获取当前滑窗矩阵的当前位置的步骤,直至所述当前滑窗矩阵遍历所述第一特征矩阵的元素;
    将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵。
  12. 根据权利要求9所述的计算机设备,其特征在于,所述预设锚点区域的宽度值为固定值;所述获取所述预测图像子区域的文本特征信息包括:
    根据所述预设锚点区域的宽度值和所述目标特征矩阵的各个元素对应的第一维度坐标获取各个预测图像子区域的水平位置;
    获取各个预测图像子区域的竖直方向预测偏移量,根据所述竖直方向预测偏移量、对应的预设锚点区域的高度值和中心坐标数值分量进行计算,分别得到各个预测图像子区域对应的预测高度值和中心点竖直方向实际偏移量;
    根据所述水平位置、预测高度值和中心点竖直方向实际偏移量确定各个预测图像子区域的文本位置信息。
  13. 根据权利要求9所述的计算机设备,其特征在于,所述文本特征信息包括文本位置信息;所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行包括:
    将各个预测图像子区域作为候选文本子区域,获取当前候选文本子区域对应的第一文本位置信息;
    根据所述第一文本位置信息获取与所述当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,将距离所述当前候选文本子区域最近的所述目标候选文本子区域作为相邻候选文本子区域;
    获取所述当前候选文本子区域对应的下一个候选文本子区域作为当前候选文本子区域,进入所述获取当前候选文本子区域对应的第一文本位置信息 的步骤,直至遍历候选文本子区域;
    将候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述文本特征信息包括文本置信度;所述将各个预测图像子区域作为候选文本子区域包括:
    获取各个预测图像子区域对应的文本置信度;
    根据所述文本置信度对各个预测图像子区域进行非极大值抑制,得到文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
  15. 根据权利要求9所述的计算机设备,其特征在于,所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行之后,所述计算机可读指令还使得所述处理器执行如下步骤:
    获取各个预测文本行对应的预测水平方向偏移量,根据所述预测水平边界偏移量修正所述预测文本行的水平边界。
  16. 根据权利要求9所述的计算机设备,其特征在于,在所述获取待检测图像之前,所述计算机可读指令还使得所述处理器执行如下步骤:
    获取模型训练数据,所述模型训练数据包括预设尺寸比例的样本图像区域集合;
    对所述样本图像区域集合进行特征提取并输入至初始化神经网络模型,所述初始化神经网络模型通过预设均值和方差的高斯分布随机数初始化神经网络模型得到;
    获取初始化神经网络模型输出的特征矩阵,将特征矩阵通过全连接层映射到对应的样本图像区域得到对应的样本图像子区域;
    获取各个样本图像子区域对应的文本特征信息,并根据文本特征信息和预设文本聚类算法得到预测文本行;
    重复进入获取模型训练数据的步骤,根据预设势能项和预设权重衰减值对文本检测模型进行优化训练,根据目标优化函数得到目标文本检测模型。
  17. 一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    获取待检测图像;
    将所述待检测图像输入至神经网络模型,输出目标特征矩阵;
    将所述目标特征矩阵输入至全连接层,所述全连接层根据预设锚点区域将所述目标特征矩阵的各个元素映射到所述待检测图像对应的预测图像子区域;及
    获取所述预测图像子区域的文本特征信息,根据所述预测图像子区域的文本特征信息通过文本聚类算法将预测图像子区域连接成对应的预测文本行,确定所述待检测图像对应的文本区域。
  18. 根据权利要求17所述的计算机存储介质,其特征在于,所述将所述待检测图像输入至神经网络模型,输出目标特征矩阵包括:
    对所述待检测图像进行特征提取得到第一特征矩阵,所述第一特征矩阵中的元素为二维元素;
    将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵;
    将所述前向特征矩阵和后向特征矩阵拼接得到所述目标特征矩阵。
  19. 根据权利要求18所述的计算机存储介质,其特征在于,所述将所述第一特征矩阵输入双向长短期记忆网络模型得到前向特征矩阵和后向特征矩阵包括:
    获取当前滑窗矩阵的当前位置,根据当前位置计算所述当前滑窗矩阵与所述第一特征矩阵的当前卷积结果,所述当前滑窗矩阵包括前向滑窗矩阵和后向滑窗矩阵;
    利用激活函数根据所述当前卷积结果与当前滑窗矩阵的前一个位置对应的长短期记忆网络模型的内部状态值得到所述长短期记忆网络模型当前位置对应的内部状态值;
    滑动当前滑窗矩阵得到下一个位置,进入所述获取当前滑窗矩阵的当前位置的步骤,直至所述当前滑窗矩阵遍历所述第一特征矩阵的元素;
    将各个当前滑窗矩阵在不同位置对应的内部状态值进行处理生成当前特征矩阵。
  20. 根据权利要求17所述的计算机存储介质,其特征在于,所述预设锚点区域的宽度值为固定值;所述获取所述预测图像子区域的文本特征信息包括:
    根据所述预设锚点区域的宽度值和所述目标特征矩阵的各个元素对应的第一维度坐标获取各个预测图像子区域的水平位置;
    获取各个预测图像子区域的竖直方向预测偏移量,根据所述竖直方向预测偏移量、对应的预设锚点区域的高度值和中心坐标数值分量进行计算,分别得到各个预测图像子区域对应的预测高度值和中心点竖直方向实际偏移量;
    根据所述水平位置、预测高度值和中心点竖直方向实际偏移量确定各个预测图像子区域的文本位置信息。
  21. 根据权利要求17所述的计算机存储介质,其特征在于,所述文本特征信息包括文本位置信息;所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行包括:
    将各个预测图像子区域作为候选文本子区域,获取当前候选文本子区域对应的第一文本位置信息;
    根据所述第一文本位置信息获取与所述当前候选文本子区域的距离小于预设距离阈值且竖直方向重叠度大于预设重叠度的目标候选文本子区域,将距离所述当前候选文本子区域最近的所述目标候选文本子区域作为相邻候选文本子区域;
    获取所述当前候选文本子区域对应的下一个候选文本子区域作为当前候选文本子区域,进入所述获取当前候选文本子区域对应的第一文本位置信息的步骤,直至遍历候选文本子区域;
    将候选文本子区域与对应的相邻候选文本子区域连接成对应的预测文本行。
  22. 根据权利要求21所述的计算机存储介质,其特征在于,所述文本特征信息包括文本置信度;所述将各个预测图像子区域作为候选文本子区域包括:
    获取各个预测图像子区域对应的文本置信度;
    根据所述文本置信度对各个预测图像子区域进行非极大值抑制,得到文本置信度大于预设文本置信度的预测图像子区域作为候选文本子区域。
  23. 根据权利要求17所述的计算机存储介质,其特征在于,所述根据所述预测图像子区域的文本特征信息和预设文本聚类算法将预测图像子区域连接成对应的预测文本行之后,所述计算机可读指令还使得所述处理器执行如下步骤:
    获取各个预测文本行对应的预测水平方向偏移量,根据所述预测水平边界偏移量修正所述预测文本行的水平边界。
  24. 根据权利要求17所述的计算机存储介质,其特征在于,在所述获取待检测图像之前,所述计算机可读指令还使得所述处理器执行如下步骤:
    获取模型训练数据,所述模型训练数据包括预设尺寸比例的样本图像区域集合;
    对所述样本图像区域集合进行特征提取并输入至初始化神经网络模型,所述初始化神经网络模型通过预设均值和方差的高斯分布随机数初始化神经网络模型得到;
    获取初始化神经网络模型输出的特征矩阵,将特征矩阵通过全连接层映射到对应的样本图像区域得到对应的样本图像子区域;
    获取各个样本图像子区域对应的文本特征信息,并根据文本特征信息和预设文本聚类算法得到预测文本行;
    重复进入获取模型训练数据的步骤,根据预设势能项和预设权重衰减值对文本检测模型进行优化训练,根据目标优化函数得到目标文本检测模型。
PCT/CN2018/107032 2017-09-25 2018-09-21 文本检测方法、存储介质和计算机设备 WO2019057169A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/572,171 US11030471B2 (en) 2017-09-25 2019-09-16 Text detection method, storage medium, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710874973.1A CN108304761A (zh) 2017-09-25 2017-09-25 文本检测方法、装置、存储介质和计算机设备
CN201710874973.1 2017-09-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/572,171 Continuation US11030471B2 (en) 2017-09-25 2019-09-16 Text detection method, storage medium, and computer device

Publications (1)

Publication Number Publication Date
WO2019057169A1 true WO2019057169A1 (zh) 2019-03-28

Family

ID=62869408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/107032 WO2019057169A1 (zh) 2017-09-25 2018-09-21 文本检测方法、存储介质和计算机设备

Country Status (3)

Country Link
US (1) US11030471B2 (zh)
CN (1) CN108304761A (zh)
WO (1) WO2019057169A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291754A (zh) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 一种文本级联检测方法、装置及存储介质
CN111291672A (zh) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 一种联合图像文本识别和模糊判断方法、装置及存储介质
CN111508019A (zh) * 2020-03-11 2020-08-07 上海商汤智能科技有限公司 目标检测方法及其模型的训练方法及相关装置、设备
CN112101344A (zh) * 2020-08-25 2020-12-18 腾讯科技(深圳)有限公司 一种视频文本跟踪方法及装置
CN112329849A (zh) * 2020-11-04 2021-02-05 中冶赛迪重庆信息技术有限公司 基于机器视觉的废钢料场卸料状态识别方法、介质及终端

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304761A (zh) 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 文本检测方法、装置、存储介质和计算机设备
CN110796129A (zh) * 2018-08-03 2020-02-14 珠海格力电器股份有限公司 一种文本行区域检测方法及装置
CN111144400B (zh) * 2018-11-06 2024-03-29 北京金山云网络技术有限公司 身份证信息的识别方法、装置、终端设备及存储介质
CN111222589B (zh) * 2018-11-27 2023-07-18 中国移动通信集团辽宁有限公司 图像文本识别方法、装置、设备及计算机存储介质
KR20200067631A (ko) * 2018-12-04 2020-06-12 삼성전자주식회사 영상 처리 장치 및 그 동작방법
CN109711406A (zh) * 2018-12-25 2019-05-03 中南大学 一种基于多尺度旋转锚点机制的多方向图像文本检测方法
CN109740482A (zh) * 2018-12-26 2019-05-10 北京科技大学 一种图像文本识别方法和装置
CN109886264A (zh) * 2019-01-08 2019-06-14 深圳禾思众成科技有限公司 一种文字检测方法、设备及计算机可读存储介质
CN109886330B (zh) * 2019-02-18 2020-11-27 腾讯科技(深圳)有限公司 文本检测方法、装置、计算机可读存储介质和计算机设备
CN110046616B (zh) * 2019-03-04 2021-05-25 北京奇艺世纪科技有限公司 图像处理模型生成、图像处理方法、装置、终端设备及存储介质
CN110197179B (zh) * 2019-03-14 2020-11-10 北京三快在线科技有限公司 识别卡号的方法和装置、存储介质及电子设备
CN110163202A (zh) * 2019-04-03 2019-08-23 平安科技(深圳)有限公司 文字区域的定位方法、装置、终端设备及介质
CN111783756B (zh) * 2019-04-03 2024-04-16 北京市商汤科技开发有限公司 文本识别方法及装置、电子设备和存储介质
CN110428504B (zh) * 2019-07-12 2023-06-27 北京旷视科技有限公司 文本图像合成方法、装置、计算机设备和存储介质
CN111104846B (zh) * 2019-10-16 2022-08-30 平安科技(深圳)有限公司 数据检测方法、装置、计算机设备和存储介质
CN110852229A (zh) * 2019-11-04 2020-02-28 泰康保险集团股份有限公司 图像中文本区域的位置确定方法、装置、设备及存储介质
CN112861836B (zh) * 2019-11-28 2022-04-22 马上消费金融股份有限公司 文本图像处理方法、文本及卡证图像质量评价方法和装置
CN111310762A (zh) * 2020-03-16 2020-06-19 天津得迈科技有限公司 一种基于物联网的智能医疗票据识别方法
CN111461182B (zh) * 2020-03-18 2023-04-18 北京小米松果电子有限公司 图像处理方法、图像处理装置及存储介质
CN111401264A (zh) * 2020-03-19 2020-07-10 上海眼控科技股份有限公司 车辆目标检测方法、装置、计算机设备和存储介质
CN111582021A (zh) * 2020-03-26 2020-08-25 平安科技(深圳)有限公司 场景图像中的文本检测方法、装置及计算机设备
CN111444850B (zh) * 2020-03-27 2023-11-14 北京爱笔科技有限公司 一种图片检测的方法和相关装置
CN113536831A (zh) * 2020-04-13 2021-10-22 北京沃东天骏信息技术有限公司 基于图像识别的助读方法、装置、设备和计算机可读介质
CN111539309A (zh) * 2020-04-21 2020-08-14 广州云从鼎望科技有限公司 一种基于ocr的数据处理方法、系统、平台、设备及介质
CN111582265A (zh) * 2020-05-14 2020-08-25 上海商汤智能科技有限公司 一种文本检测方法及装置、电子设备和存储介质
CN111832616A (zh) * 2020-06-04 2020-10-27 中国科学院空天信息创新研究院 利用多类深度表示图的遥感图像飞机型号识别方法及系统
CN111666941B (zh) * 2020-06-12 2024-03-29 北京达佳互联信息技术有限公司 一种文本检测方法、装置及电子设备
CN111832491A (zh) * 2020-07-16 2020-10-27 Oppo广东移动通信有限公司 文本检测方法、装置及处理设备
CN112926372B (zh) * 2020-08-22 2023-03-10 清华大学 基于序列变形的场景文字检测方法及系统
CN112199526B (zh) * 2020-09-30 2023-03-14 抖音视界有限公司 一种多媒体内容发布的方法、装置、电子设备及存储介质
CN112363918B (zh) * 2020-11-02 2024-03-08 北京云聚智慧科技有限公司 用户界面ai自动化测试方法、装置、设备和存储介质
US20220147843A1 (en) * 2020-11-12 2022-05-12 Samsung Electronics Co., Ltd. On-device knowledge extraction from visually rich documents
CN112232305A (zh) * 2020-11-19 2021-01-15 中国银联股份有限公司 图像检测方法、装置、电子设备及介质
CN112686114A (zh) * 2020-12-23 2021-04-20 杭州海康威视数字技术股份有限公司 一种行为检测方法、装置及设备
CN112749978B (zh) * 2020-12-31 2024-02-06 百度在线网络技术(北京)有限公司 检测方法、装置、设备、存储介质以及程序产品
CN112926564A (zh) * 2021-02-25 2021-06-08 中国平安人寿保险股份有限公司 图片分析方法、系统、计算机设备和计算机可读存储介质
CN113112406B (zh) * 2021-04-12 2023-01-31 山东迈科显微生物科技有限公司 一种特征确定方法、装置、电子设备以及存储介质
CN112990204B (zh) * 2021-05-11 2021-08-24 北京世纪好未来教育科技有限公司 目标检测方法、装置、电子设备及存储介质
CN115410207B (zh) * 2021-05-28 2023-08-29 国家计算机网络与信息安全管理中心天津分中心 一种针对竖排文本的检测方法及装置
CN113887535B (zh) * 2021-12-03 2022-04-12 北京世纪好未来教育科技有限公司 模型训练方法、文本识别方法、装置、设备和介质
CN115631493B (zh) * 2022-11-04 2023-05-09 金蝶软件(中国)有限公司 文本区确定方法、系统及相关装置
CN115546790B (zh) * 2022-11-29 2023-04-07 深圳智能思创科技有限公司 文档版面分割方法、装置、设备及存储介质
CN116341640B (zh) * 2023-05-29 2023-08-11 深圳须弥云图空间科技有限公司 文本处理模型训练方法及装置
CN116740740B (zh) * 2023-08-11 2023-11-21 浙江太美医疗科技股份有限公司 同行文本判定方法、文档排序方法及其应用

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170011279A1 (en) * 2015-07-07 2017-01-12 Xerox Corporation Latent embeddings for word images and their semantics
CN106897732A (zh) * 2017-01-06 2017-06-27 华中科技大学 一种基于连接文字段的自然图片中多方向文本检测方法
CN108304761A (zh) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 文本检测方法、装置、存储介质和计算机设备

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555317A (en) * 1992-08-18 1996-09-10 Eastman Kodak Company Supervised training augmented polynomial method and apparatus for character recognition
US7164797B2 (en) * 2002-04-25 2007-01-16 Microsoft Corporation Clustering
CN1459761B (zh) * 2002-05-24 2010-04-21 清华大学 基于Gabor滤波器组的字符识别技术
US7570816B2 (en) * 2005-03-31 2009-08-04 Microsoft Corporation Systems and methods for detecting text
US8775341B1 (en) * 2010-10-26 2014-07-08 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
US9058644B2 (en) * 2013-03-13 2015-06-16 Amazon Technologies, Inc. Local image enhancement for text recognition
US8965127B2 (en) * 2013-03-14 2015-02-24 Konica Minolta Laboratory U.S.A., Inc. Method for segmenting text words in document images
CN104298982B (zh) * 2013-07-16 2019-03-08 深圳市腾讯计算机系统有限公司 一种文字识别方法及装置
US9245191B2 (en) * 2013-09-05 2016-01-26 Ebay, Inc. System and method for scene text recognition
WO2016054778A1 (en) * 2014-10-09 2016-04-14 Microsoft Technology Licensing, Llc Generic object detection in images
CN105868758B (zh) * 2015-01-21 2019-12-17 阿里巴巴集团控股有限公司 图像中文本区域检测方法、装置及电子设备
US10043231B2 (en) * 2015-06-30 2018-08-07 Oath Inc. Methods and systems for detecting and recognizing text from images
CN106599900B (zh) * 2015-10-20 2020-04-21 华中科技大学 一种识别图像中的字符串的方法和装置
CN106384112A (zh) * 2016-09-08 2017-02-08 西安电子科技大学 基于多通道多尺度与级联过滤器的快速图像文本检测方法
CN106570497A (zh) * 2016-10-08 2017-04-19 中国科学院深圳先进技术研究院 一种场景图像的文本检测方法和装置
CN108171104B (zh) * 2016-12-08 2022-05-10 腾讯科技(深圳)有限公司 一种文字检测方法及装置
US10430649B2 (en) * 2017-07-14 2019-10-01 Adobe Inc. Text region detection in digital images using image tag filtering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170011279A1 (en) * 2015-07-07 2017-01-12 Xerox Corporation Latent embeddings for word images and their semantics
CN106897732A (zh) * 2017-01-06 2017-06-27 华中科技大学 一种基于连接文字段的自然图片中多方向文本检测方法
CN108304761A (zh) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 文本检测方法、装置、存储介质和计算机设备

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291754A (zh) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 一种文本级联检测方法、装置及存储介质
CN111291672A (zh) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 一种联合图像文本识别和模糊判断方法、装置及存储介质
CN111291672B (zh) * 2020-01-22 2023-05-12 广州图匠数据科技有限公司 一种联合图像文本识别和模糊判断方法、装置及存储介质
CN111291754B (zh) * 2020-01-22 2023-05-12 广州图匠数据科技有限公司 一种文本级联检测方法、装置及存储介质
CN111508019A (zh) * 2020-03-11 2020-08-07 上海商汤智能科技有限公司 目标检测方法及其模型的训练方法及相关装置、设备
CN112101344A (zh) * 2020-08-25 2020-12-18 腾讯科技(深圳)有限公司 一种视频文本跟踪方法及装置
CN112329849A (zh) * 2020-11-04 2021-02-05 中冶赛迪重庆信息技术有限公司 基于机器视觉的废钢料场卸料状态识别方法、介质及终端

Also Published As

Publication number Publication date
CN108304761A (zh) 2018-07-20
US11030471B2 (en) 2021-06-08
US20200012876A1 (en) 2020-01-09

Similar Documents

Publication Publication Date Title
WO2019057169A1 (zh) 文本检测方法、存储介质和计算机设备
WO2022213879A1 (zh) 目标对象检测方法、装置、计算机设备和存储介质
US11798174B2 (en) Method, device, equipment and storage medium for locating tracked targets
WO2020238560A1 (zh) 视频目标跟踪方法、装置、计算机设备及存储介质
Yu et al. Vision-based concrete crack detection using a hybrid framework considering noise effect
CN106204522B (zh) 对单个图像的联合深度估计和语义标注
WO2020000390A1 (en) Systems and methods for depth estimation via affinity learned with convolutional spatial propagation networks
US20190347806A1 (en) Video object tracking
WO2022142450A1 (zh) 用于图像分割模型训练和图像分割的方法及装置
Zhang et al. Deep learning for detecting building façade elements from images considering prior knowledge
US11768876B2 (en) Method and device for visual question answering, computer apparatus and medium
CN107886082B (zh) 图像中数学公式检测方法、装置、计算机设备及存储介质
US20230055146A1 (en) Methods for recognizing small targets based on deep learning networks
US20220277514A1 (en) Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models
US11544498B2 (en) Training neural networks using consistency measures
WO2019117970A1 (en) Adaptive object tracking policy
US10438088B2 (en) Visual-saliency driven scene description
CN109544516B (zh) 图像检测方法及装置
CN113012169A (zh) 一种基于非局部注意力机制的全自动抠图方法
CN111080697B (zh) 检测目标对象方向的方法、装置、计算机设备和存储介质
CN109448018B (zh) 跟踪目标的定位方法、装置、设备及存储介质
CN117422934A (zh) 异常细胞检测方法、系统、计算机设备及存储介质
US20200394808A1 (en) Aligning digital images by selectively applying pixel-adjusted-gyroscope alignment and feature-based alignment models
US11961249B2 (en) Generating stereo-based dense depth images
Zou et al. YOLOv7‐EAS: A Small Target Detection of Camera Module Surface Based on Improved YOLOv7

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18858149

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18858149

Country of ref document: EP

Kind code of ref document: A1