WO2022105120A1 - Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire - Google Patents

Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire Download PDF

Info

Publication number
WO2022105120A1
WO2022105120A1 PCT/CN2021/090512 CN2021090512W WO2022105120A1 WO 2022105120 A1 WO2022105120 A1 WO 2022105120A1 CN 2021090512 W CN2021090512 W CN 2021090512W WO 2022105120 A1 WO2022105120 A1 WO 2022105120A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
picture
coordinates
target detection
preset
Prior art date
Application number
PCT/CN2021/090512
Other languages
English (en)
Chinese (zh)
Inventor
左彬靖
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105120A1 publication Critical patent/WO2022105120A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for detecting text in pictures.
  • target detection technology text detection methods in target detection technology are used in more and more fields, such as Alipay scanning, ID card recognition and so on.
  • text detection methods in target detection technology are used in more and more fields, such as Alipay scanning, ID card recognition and so on.
  • the information in the picture can be extracted.
  • the algorithm based on FPN feature pyramid networks, feature pyramid network
  • FPN feature pyramid networks, feature pyramid network
  • the algorithm based on pixel level has relatively high accuracy, but the processing time of the model is long, which is difficult to meet the requirements of industrialization. need.
  • the image text detection is mainly to extract useful information in the image, such as name, address, account information and other fields, so as to facilitate the subsequent storage of these parameters and provide data for the subsequent risk control system.
  • a picture may contain a lot of information, and a more complex picture may contain more than one hundred fields.
  • the purpose of the embodiments of the present application is to provide a method, device, computer equipment and storage medium for detecting pictures and characters, so as to solve the technical problem of low efficiency of detecting pictures and characters.
  • the embodiment of the present application provides a method for detecting text in pictures, which adopts the following technical solutions:
  • the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;
  • the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value.
  • the first text box of the value is determined to be a fixed text box;
  • the embodiments of the present application also provide a picture and text detection device, which adopts the following technical solutions:
  • a detection module configured to calculate the complexity of the target detection picture according to a preset detection model when the target detection picture is received
  • the labeling module is configured to obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model when the complexity is low complexity, and obtain the target detection picture according to the feature vector calculation.
  • a confirmation module configured to calculate the center coordinates of the target detection image according to the target text coordinates, fuse the first text box with the center coordinates less than or equal to a preset error value into a new text box, and set the center coordinates greater than The first text box of the preset error value is determined to be a fixed text box;
  • the extraction module is used to extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes
  • the computer-readable instructions also implement the following steps:
  • the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;
  • the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value.
  • the first text box of the value is determined to be a fixed text box;
  • an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processing The device also performs the following steps:
  • the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the feature vector of the first text box in the target detection picture is calculated according to the feature vector. target text coordinates;
  • the center coordinates of the target detection picture are calculated, the first text box whose center coordinates are less than or equal to the preset error value is merged into a new text box, and the center coordinates are greater than the preset error value.
  • the first text box of the value is determined to be a fixed text box;
  • the complexity of the target detection image is calculated according to the preset detection model, and the model of the target detection image can be selected according to the complexity, so as to further target the target detection image.
  • the feature vector of the target detection picture is obtained according to the first labeling model in the preset labeling model, and the target detection picture is calculated according to the feature vector.
  • the target text coordinates of the first text box in the text box can be used to accurately locate the text information of the target detection picture; then, according to the target text coordinates, the center coordinates of the target detection picture are calculated, and the center coordinates are less than or equal to the preset error value.
  • the first text box is merged into a new text box, and the first text box whose center coordinates are greater than the preset error value is determined as a fixed text box, thereby avoiding the wrong splitting of pictures and texts in low complexity, and improving the picture and text.
  • the accuracy of detection finally, extract the text information in the new text box and the fixed text box, and determine the text information as the detection text of the target detection image, which realizes the text detection of images of different complexity, reduces the cost of manual annotation, and saves
  • the response time processed by the model further improves the efficiency and accuracy of image and text detection.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flow chart of an embodiment of a picture text detection method according to the present application.
  • FIG. 3 is a schematic structural diagram of an embodiment of a picture and text detection device according to the present application.
  • FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • Reference numerals picture and text detection device 300, detection module 301, labeling module 302, confirmation module 303 and extraction module 304.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the image and text detection methods provided in the embodiments of the present application are generally executed by a server/terminal device, and correspondingly, the image and text detection apparatus is generally set in the server/terminal device.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the image and text detection method includes the following steps:
  • Step S201 when receiving the target detection picture, calculate the complexity of the target detection picture according to a preset detection model
  • the target detection picture is a detection picture including target text
  • the complexity of the target detection picture is calculated according to a preset detection model
  • the preset detection model is a preset picture complexity detection model, such as A lightweight convolutional neural network discriminant model based on VGG16.
  • input the target detection picture into the preset detection model calculate the length, width and channel number of the target detection picture based on the convolution layer, pooling layer and fully connected layer of the preset detection model, and output
  • the detection result value of the target detection picture is obtained; then the detection result value is calculated according to the two-class loss function, that is, the complexity of the current target detection picture is obtained.
  • Step S202 when the complexity is low complexity, obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model, and obtain the first target detection picture according to the feature vector calculation.
  • the complexity can be divided into low complexity and high complexity according to the preset value, the complexity less than or equal to the preset value is the low complexity, and the complexity greater than the preset value is the high complexity the complexity.
  • the preset annotation model is a preset text coordinate detection model, including a first annotation model and a second annotation model. Detecting a low-complexity target detection picture according to the first annotation model can obtain the target text coordinates of the target detection picture; according to the second labeling model, a high-complexity target detection picture can be detected, and the detection of the target detection picture can be obtained. Text coordinates.
  • the detected texts of the low-complexity and high-complexity target detection images can be obtained respectively.
  • the coordinates of the target text and the coordinates of the detected text are composed of the coordinates of the lower left corner, the lower right corner, the upper left corner and the upper right corner of each text box in the target detection picture.
  • a feature map of the target detection picture and a preset detection feature frame are acquired. Calculate the feature picture and the detection feature frame based on the first labeling model to obtain the feature vector of the target detection picture; the second feature vector is passed through the bidirectional long short-term memory network, the fully connected layer and the regression layer in the first labeling model. , and output the target text coordinates of the current target detection image.
  • Step S203 Calculate the center coordinates of the target detection picture according to the target text coordinates, fuse the first text box whose center coordinates are less than or equal to a preset error value into a new text box, and set the center coordinates greater than the The first text box of the preset error value is determined as a fixed text box;
  • the first text box is a text box obtained by detecting the target picture according to the first annotation model
  • the center coordinates are the mean coordinates of the first text boxes in each target detection picture. Calculate the x mean value and the y mean value of the target text coordinates of each first text box in the target detection image, and use the x mean value and the y mean value as the center coordinates of the corresponding first text box.
  • the center coordinates corresponding to each first text box are obtained, the first text boxes whose center coordinates are less than or equal to the preset error value are merged into a new text box.
  • the coordinates of the lower left corner of the new text box take the minimum x value and the minimum y value of the coordinates of the target text in the fused first text box, and the coordinates of the upper right corner of the new text box take the coordinates of the target text in the first fused text box.
  • the maximum x value and the maximum y value, the coordinates of the lower right corner of the new text box take the maximum x value and the minimum y value of the target text coordinates in the first text box fused, and the coordinates of the upper left corner of the new text box take the first fused text
  • the first text box whose center coordinates are greater than the preset error value is determined as a fixed text box.
  • Step S204 Extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
  • the text information in the new text box and the fixed text box is extracted, and the text information is arranged in the order of the text boxes, that is, the target detection picture is obtained. Detect text.
  • the above detection text can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • This embodiment realizes text detection for pictures of different complexity, reduces the cost of manual labeling, saves the response time of model processing, and further improves the efficiency and accuracy of text detection in pictures.
  • the preset error value includes a first error value and a second error value
  • the fusion of the first text box whose center coordinates are less than or equal to the preset error value into a new text box includes:
  • a first text box whose first pixel difference value is less than or equal to the first error value and whose second pixel difference value is less than or equal to the second error value is merged into a new text box.
  • the preset error value includes a first error value and a second error value.
  • the difference between the y-axis coordinates of two adjacent center coordinates is sequentially obtained.
  • the first pixel difference value, and the second pixel difference value between the x-axis coordinates of the two center coordinates.
  • the first pixel difference value is the pixel difference value between the y-axis coordinates of the two center point coordinates
  • the second pixel difference value is the pixel difference value between the x-axis coordinates of the two center point coordinates.
  • a new text box is obtained by fusing the first text boxes with the first pixel difference between the center coordinates less than or equal to the first error value and the second pixel difference less than or equal to the second error value.
  • This embodiment realizes the combination of texts with small errors by fusing the text boxes, avoids the wrong splitting of texts in the process of text detection for low-complexity pictures, and further improves the accuracy of text detection in pictures .
  • the method after calculating the complexity of the target detection picture according to the preset detection model, the method includes:
  • the minimum text coordinates are mapped to the largest picture corresponding to the target detection picture in parallel to obtain the detected text coordinates of the target detection picture, and the detected text corresponding to the target detection picture is calculated according to the detected text coordinates.
  • the minimum text coordinates of the second text box in the minimum picture corresponding to the target detection picture are obtained according to the second labeling model in the preset labeling model.
  • the second text box is a text box obtained by detecting the target detection picture according to the second labeling model
  • the second labeling model is a pre-trained high-complexity labeling model.
  • the minimum picture corresponding to the target detection picture is obtained according to the second labeling model, and the minimum picture is the minimum picture after scaling the target detection picture.
  • the second labeling model can perform pixel scaling on the target detection picture to obtain the minimum picture.
  • the second text box in the minimum picture is detected based on the second labeling model, thereby obtaining the minimum text coordinates corresponding to the second text box in the minimum picture.
  • the minimum text coordinates are obtained, the minimum text coordinates are mapped to the maximum picture corresponding to the target detection picture, that is, all the obtained minimum text coordinates are enlarged according to the preset mapping ratio between the minimum picture and the maximum picture at the same time.
  • Detect text coordinates Obtaining the text content in the target text coordinates is to obtain the detection text of the target detection image.
  • This embodiment uses the second labeling model to perform text detection on pictures with high complexity, thereby realizing targeted detection of text in pictures with high complexity, and further improving the detection efficiency and accuracy of pictures with high complexity.
  • the above-mentioned parallel mapping of the minimum text coordinates to the maximum picture corresponding to the target detection picture, and obtaining the detected text coordinates of the target detection picture includes:
  • a preset mapping ratio is acquired, and the minimum text coordinates are enlarged in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.
  • the target detection picture when obtaining the target text coordinates corresponding to the target detection picture with high complexity, can be obtained by obtaining a preset mapping ratio, and mapping the minimum text coordinates to the largest picture in parallel according to the preset mapping ratio.
  • the detected text coordinates Specifically, the preset mapping ratio is a preset ratio when the second annotation model scales the target detection picture, and the ratio ranges from 0 to 1. For example, the preset mapping ratio is 0.4.
  • the preset mapping ratio is obtained, all the obtained minimum text coordinates are simultaneously enlarged according to the preset mapping ratio, that is, the detected text coordinates of the target detection image are obtained.
  • the accurate acquisition of the detected text coordinates of the target detection picture is realized, so that the text information of the target detection picture can be accurately located by detecting the text coordinates, avoiding the need for Text confusion may occur when text detection is performed on pictures with high complexity.
  • the above calculation of the complexity of the target detection picture according to the preset detection model includes:
  • the detection result value is predicted according to a preset two-class loss function to obtain the complexity of the target detection picture.
  • the preset detection model includes a convolution layer, a pooling layer, and a fully connected layer.
  • the length, width and channel number of the target detection picture are obtained.
  • the detection result value is obtained, the first result value is calculated by the preset two-class loss function, that is, the complexity of the current target detection picture is obtained.
  • the complexity can be represented by p, and the range of p is between 0 and 1.
  • the larger p is, the smaller the text in the target detection picture, the smaller the interval between words, and the higher the complexity of the target detection picture. High; the smaller the p, the larger the text in the target detection picture, the larger the interval between words, and the lower the complexity of the target detection picture.
  • the complexity of the target detection picture is calculated, so as to realize the classification and detection of the target detection picture according to the complexity, and further improve the detection efficiency of the target detection picture.
  • the method before the above step of acquiring the feature vector of the target detection picture according to the first annotation model in the preset annotation model, the method further includes:
  • the trained basic labeling model is verified according to the test picture, and when the verification pass rate of the trained basic labeling model on the test picture is greater than or equal to a preset pass rate, the trained basic labeling model is determined.
  • the annotation model is a preset annotation model.
  • a basic annotation model needs to be established in advance, and the basic annotation model is trained to obtain the preset annotation model.
  • the preset labeling model includes a first labeling model and a second labeling model.
  • the first labeling model is used for processing low-complexity target detection pictures
  • the second labeling model is used for processing high-complexity target detection pictures.
  • the first labeling model and the second labeling model have different network structures, but both the first labeling model and the second labeling model can be trained by the same training method.
  • an initial text picture is obtained, where the initial text picture is a plurality of pre-collected text pictures, and the initial text picture is divided into a training picture and a test picture.
  • the initial text coordinates of the training picture are detected based on a basic labeling model, where the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model.
  • the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model.
  • the initial text coordinates of the training picture are obtained, and at the same time, the training picture is labelled according to the preset labeling tool, and the labelled text coordinates of the training picture are obtained.
  • the basic labeling model is trained according to the initial text coordinates and the labeling text coordinates, that is, the loss function of the basic labeling model is calculated according to the labeling text coordinates and the initial text coordinates. When the loss function converges, the trained basic labeling model is obtained.
  • the trained basic labeling model is tested according to the test image. If the similarity between the initial text coordinates detected by the trained basic labeling model and the labeling text coordinates corresponding to the test picture is greater than or equal to a preset similarity threshold, then the trained basic labeling model is determined to be the same as the test picture. Verification passed. When the verification pass rate of the trained basic annotation model for the test image is greater than or equal to the preset pass rate, the trained basic annotation model is determined to be the preset annotation model.
  • the basic labeling model is trained in advance, so that the preset labeling model obtained by training can accurately detect the text of the picture, save the labeling time of the picture and text detection, and improve the efficiency of the picture and text detection.
  • the above-mentioned calculation of the loss function of the basic annotation model according to the coordinates of the annotation text includes:
  • the training picture is labeled according to a preset labeling tool, and the labeled text coordinates of the training picture are obtained. Calculate the squared difference between the initial text coordinates and the labeled text coordinates, and then calculate the loss function of the basic labeling model according to the squared difference.
  • the calculation formula of the loss function of the basic annotation model is as follows:
  • ⁇ k is the initial text coordinate, coordinates for the label text.
  • the training time of the basic labeling model is saved, and the training efficiency of the basic labeling model is improved.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a picture and text detection device, and the device embodiment corresponds to the method embodiment shown in FIG. 2 .
  • the device embodiment corresponds to the method embodiment shown in FIG. 2 .
  • the image and text detection apparatus 300 in this embodiment includes: a detection module 301 , a labeling module 302 , a confirmation module 303 , and an extraction module 304 . in:
  • a detection module 301 configured to calculate the complexity of the target detection picture according to a preset detection model when the target detection picture is received;
  • the detection module 301 includes:
  • a first computing unit used for inputting the target detection picture to the convolution layer of the preset detection model, and outputting the detection result value through the pooling layer and the fully connected layer;
  • the second calculation unit is configured to predict the detection result value according to the preset binary classification loss function, and obtain the complexity of the target detection picture.
  • the target detection picture is a detection picture including target text
  • the complexity of the target detection picture is calculated according to a preset detection model
  • the preset detection model is a preset picture complexity detection model, such as A lightweight convolutional neural network discriminant model based on VGG16.
  • input the target detection picture into the preset detection model calculate the length, width and channel number of the target detection picture based on the convolution layer, pooling layer and fully connected layer of the preset detection model, and output
  • the detection result value of the target detection picture is obtained; then the detection result value is calculated according to the two-class loss function, that is, the complexity of the current target detection picture is obtained.
  • the labeling module 302 is configured to obtain the feature vector of the target detection picture according to the first labeling model in the preset labeling model when the complexity is low complexity, and calculate the target detection picture according to the feature vector The target text coordinates of the first text box in ;
  • the complexity can be divided into low complexity and high complexity according to the preset value, the complexity less than or equal to the preset value is the low complexity, and the complexity greater than the preset value is the high complexity the complexity.
  • the preset annotation model is a preset text coordinate detection model, including a first annotation model and a second annotation model. Detecting a low-complexity target detection picture according to the first annotation model can obtain the target text coordinates of the target detection picture; according to the second labeling model, a high-complexity target detection picture can be detected, and the detection of the target detection picture can be obtained. Text coordinates.
  • the detected texts of the low-complexity and high-complexity target detection images can be obtained respectively.
  • the coordinates of the target text and the coordinates of the detected text are composed of the coordinates of the lower left corner, the lower right corner, the upper left corner and the upper right corner of each text box in the target detection picture.
  • a feature map of the target detection picture and a preset detection feature frame are acquired. Calculate the feature picture and the detection feature frame based on the first labeling model to obtain the feature vector of the target detection picture; the second feature vector is passed through the bidirectional long short-term memory network, the fully connected layer and the regression layer in the first labeling model. , and output the target text coordinates of the current target detection image.
  • Confirmation module 303 configured to calculate the center coordinates of the target detection picture according to the target text coordinates, fuse the first text box whose center coordinates are less than or equal to a preset error value into a new text box, and combine the center coordinates A first text box larger than the preset error value is determined as a fixed text box;
  • the preset error value includes a first error value and a second error value
  • the confirmation module 303 includes:
  • an acquisition unit configured to acquire the first pixel difference between the y-axis coordinates of the two adjacent center coordinates, and the second pixel difference between the x-axis coordinates of the center coordinates;
  • a confirmation unit configured to merge a first text box whose first pixel difference value is less than or equal to the first error value and the second pixel difference value is less than or equal to the second error value into a new text box.
  • the first text box is a text box obtained by detecting the target picture according to the first annotation model
  • the center coordinates are the mean coordinates of the first text boxes in each target detection picture. Calculate the x mean value and the y mean value of the target text coordinates of each first text box in the target detection image, and use the x mean value and the y mean value as the center coordinates of the corresponding first text box.
  • the center coordinates corresponding to each first text box are obtained, the first text boxes whose center coordinates are less than or equal to the preset error value are merged into a new text box.
  • the coordinates of the lower left corner of the new text box take the minimum x value and the minimum y value of the coordinates of the target text in the fused first text box, and the coordinates of the upper right corner of the new text box take the coordinates of the target text in the first fused text box.
  • the maximum x value and the maximum y value, the coordinates of the lower right corner of the new text box take the maximum x value and the minimum y value of the target text coordinates in the first text box fused, and the coordinates of the upper left corner of the new text box take the first fused text
  • the first text box whose center coordinates are greater than the preset error value is determined as a fixed text box.
  • the extraction module 304 is configured to extract the text information in the new text box and the fixed text box, and determine that the text information is the detection text of the target detection picture.
  • the text information in the new text box and the fixed text box is extracted, and the text information is arranged in the order of the text boxes, that is, the target detection picture is obtained. Detect text.
  • the above detection text can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the obtaining module is configured to obtain the minimum picture corresponding to the target detection picture according to the second labeling model in the preset labeling model when the complexity is high, and the minimum picture of the second text box in the minimum picture. text coordinates;
  • the mapping module is used to map the minimum text coordinates to the maximum picture corresponding to the target detection picture in parallel, to obtain the detected text coordinates of the target detection picture, and to calculate and obtain the corresponding target detection picture according to the detected text coordinates. Detect text.
  • mapping module includes:
  • the mapping unit is configured to obtain a preset mapping ratio, and enlarge the minimum text coordinates in parallel according to the preset mapping ratio to obtain the detected text coordinates of the target detection image.
  • the minimum text coordinates of the second text box in the minimum picture corresponding to the target detection picture are obtained according to the second labeling model in the preset labeling model.
  • the second text box is a text box obtained by detecting the target detection picture according to the second labeling model
  • the second labeling model is a pre-trained high-complexity labeling model.
  • the minimum picture corresponding to the target detection picture is obtained according to the second labeling model, and the minimum picture is the minimum picture after scaling the target detection picture.
  • the second labeling model can perform pixel scaling on the target detection picture to obtain the minimum picture.
  • the second text box in the minimum picture is detected based on the second labeling model, thereby obtaining the minimum text coordinates corresponding to the second text box in the minimum picture.
  • the minimum text coordinates are obtained, the minimum text coordinates are mapped to the maximum picture corresponding to the target detection picture, that is, all the obtained minimum text coordinates are enlarged according to the preset mapping ratio between the minimum picture and the maximum picture at the same time.
  • Detect text coordinates Obtaining the text content in the target text coordinates is to obtain the detection text of the target detection image.
  • a division module configured to obtain an initial text picture, divide the initial text picture into a training picture and a test picture, input the training picture into a preset basic labeling model, and obtain the labeling text coordinates of the training picture;
  • a training module configured to calculate the loss function of the basic labeling model according to the coordinates of the labeling text, and when the loss function converges, determine that the basic labeling model is the trained basic labeling model;
  • a verification module configured to verify the trained basic labeling model according to the test picture, and determine the The trained basic annotation model is the preset annotation model.
  • the training module includes:
  • a labeling unit configured to label the training picture based on a preset labeling tool to obtain initial text coordinates of the training picture
  • the third calculation unit is configured to calculate the squared difference between the initial text coordinates and the labeled text coordinates, and calculate the loss function of the basic labeling model according to the squared difference.
  • a basic annotation model needs to be established in advance, and the basic annotation model is trained to obtain the preset annotation model.
  • the preset labeling model includes a first labeling model and a second labeling model.
  • the first labeling model is used for processing low-complexity target detection pictures
  • the second labeling model is used for processing high-complexity target detection pictures.
  • the first labeling model and the second labeling model have different network structures, but both the first labeling model and the second labeling model can be trained by the same training method. Specifically, an initial text picture is obtained, where the initial text picture is a plurality of pre-collected text pictures, and the initial text picture is divided into a training picture and a test picture.
  • the initial text coordinates of the training image are detected based on a basic labeling model, where the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model.
  • the basic labeling model may be the network structure of the first labeling model or the network structure of the second labeling model.
  • the initial text coordinates of the training picture are obtained, and at the same time, the training picture is labelled according to the preset labeling tool, and the labelled text coordinates of the training picture are obtained.
  • the basic labeling model is trained according to the initial text coordinates and the labeling text coordinates, that is, the loss function of the basic labeling model is calculated according to the labeling text coordinates and the initial text coordinates. When the loss function converges, the trained basic labeling model is obtained.
  • the trained basic labeling model is tested according to the test image. If the similarity between the initial text coordinates detected by the trained basic labeling model and the labeling text coordinates corresponding to the test picture is greater than or equal to the preset similarity threshold, it is determined that the trained basic labeling model is the same as the test picture. Verification passed. When the verification pass rate of the trained basic labeling model for the test image is greater than or equal to the preset pass rate, the trained basic labeling model is determined to be the preset labeling model.
  • the image text detection device proposed in this embodiment realizes text detection on images of different complexity, reduces manual labeling costs, saves the response time of model processing, and further improves the efficiency and accuracy of image text detection.
  • FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 6 includes a memory 61 , a processor 62 , and a network interface 63 that communicate with each other through a system bus. It should be pointed out that only the computer device 6 with components 61-63 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 61 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the memory 61 may be an internal storage unit of the computer device 6 , such as a hard disk or a memory of the computer device 6 .
  • the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 61 may also include both the internal storage unit of the computer device 6 and its external storage device.
  • the memory 61 is generally used to store the operating system and various application software installed on the computer device 6 , such as computer-readable instructions of a picture and text detection method.
  • the memory 61 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 62 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 62 is typically used to control the overall operation of the computer device 6 . In this embodiment, the processor 62 is configured to execute computer-readable instructions or process data stored in the memory 61, for example, computer-readable instructions for executing the image and text detection method.
  • CPU Central Processing Unit
  • controller a microcontroller
  • microprocessor microprocessor
  • This processor 62 is typically used to control the overall operation of the computer device 6 .
  • the processor 62 is configured to execute computer-readable instructions or process data stored in the memory 61, for example, computer-readable instructions for executing the image and text detection method.
  • the network interface 63 may include a wireless network interface or a wired network interface, and the network interface 63 is generally used to establish a communication connection between the computer device 6 and other electronic devices.
  • the computer device proposed in this embodiment realizes text detection for pictures of different complexity, reduces manual labeling costs, saves response time for model processing, and further improves the efficiency and accuracy of picture text detection.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the above-mentioned picture text detection method.
  • the computer-readable storage medium proposed in this embodiment realizes text detection for pictures of different complexity, reduces manual labeling costs, saves the response time of model processing, and further improves the efficiency and accuracy of picture text detection.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire, relatifs au domaine de l'intelligence artificielle. Le procédé consiste : lors de la réception d'une image de détection de cible, à calculer la complexité de l'image de détection de cible selon un modèle de détection pré-configuré ; lorsque la complexité est faible, à calculer des coordonnées de texte cible de premières boîtes de texte dans l'image de détection de cible selon un premier modèle de marquage dans des modèles de marquage pré-configurés ; à calculer des coordonnées centrales de l'image de détection de cible selon les coordonnées de texte cible, à fusionner les premières boîtes de texte dont les coordonnées centrales sont inférieures ou égales à une valeur d'erreur préétablie dans une nouvelle boîte de texte, et à déterminer les premières boîtes de texte dont les coordonnées centrales sont supérieures à la valeur d'erreur préétablie en tant que boîtes de texte fixes ; à extraire des informations de texte de la nouvelle boîte de texte et des boîtes de texte fixes, et à déterminer les informations de texte en tant que textes détectés. Le procédé effectue une détection de texte efficace à partir d'images.
PCT/CN2021/090512 2020-11-17 2021-04-28 Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire WO2022105120A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011286320.XA CN112395450B (zh) 2020-11-17 2020-11-17 图片文字检测方法、装置、计算机设备及存储介质
CN202011286320.X 2020-11-17

Publications (1)

Publication Number Publication Date
WO2022105120A1 true WO2022105120A1 (fr) 2022-05-27

Family

ID=74600891

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090512 WO2022105120A1 (fr) 2020-11-17 2021-04-28 Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire

Country Status (2)

Country Link
CN (1) CN112395450B (fr)
WO (1) WO2022105120A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395450B (zh) * 2020-11-17 2024-03-19 平安科技(深圳)有限公司 图片文字检测方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615252A (zh) * 2008-06-25 2009-12-30 中国科学院自动化研究所 一种自适应图像文本信息提取方法
CN111340139A (zh) * 2020-03-27 2020-06-26 中国科学院微电子研究所 一种图像内容复杂度的判别方法及装置
CN111612003A (zh) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 一种提取图片中的文本的方法和装置
CN112395450A (zh) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 图片文字检测方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262699B2 (en) * 2012-07-19 2016-02-16 Qualcomm Incorporated Method of handling complex variants of words through prefix-tree based decoding for Devanagiri OCR
CN109685055B (zh) * 2018-12-26 2021-11-12 北京金山数字娱乐科技有限公司 一种图像中文本区域的检测方法及装置
CN110046616B (zh) * 2019-03-04 2021-05-25 北京奇艺世纪科技有限公司 图像处理模型生成、图像处理方法、装置、终端设备及存储介质
WO2020223859A1 (fr) * 2019-05-05 2020-11-12 华为技术有限公司 Procédé de détection de texte cursif, appareil et dispositif

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615252A (zh) * 2008-06-25 2009-12-30 中国科学院自动化研究所 一种自适应图像文本信息提取方法
CN111612003A (zh) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 一种提取图片中的文本的方法和装置
CN111340139A (zh) * 2020-03-27 2020-06-26 中国科学院微电子研究所 一种图像内容复杂度的判别方法及装置
CN112395450A (zh) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 图片文字检测方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN112395450A (zh) 2021-02-23
CN112395450B (zh) 2024-03-19

Similar Documents

Publication Publication Date Title
WO2022142014A1 (fr) Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant
US20220253631A1 (en) Image processing method, electronic device and storage medium
WO2022174491A1 (fr) Procédé et appareil fondés sur l'intelligence artificielle pour le contrôle qualité des dossiers médicaux, dispositif informatique et support de stockage
US10311288B1 (en) Determining identity of a person in a digital image
US11861919B2 (en) Text recognition method and device, and electronic device
WO2023035531A1 (fr) Procédé de reconstruction à super-résolution pour image de texte et dispositif associé
KR20210090576A (ko) 품질을 관리하는 방법, 장치, 기기, 저장매체 및 프로그램
CN113780098B (zh) 文字识别方法、装置、电子设备以及存储介质
WO2022142032A1 (fr) Procédé et appareil de vérification de signature manuscrite, dispositif informatique et support de stockage
WO2022105119A1 (fr) Procédé de génération de corpus d'apprentissage pour un modèle de reconnaissance d'intention, et dispositif associé
WO2023280106A1 (fr) Procédé et appareil d'acquisition d'informations, dispositif et support
CN112330331A (zh) 基于人脸识别的身份验证方法、装置、设备及存储介质
CN112395834B (zh) 基于图片输入的脑图生成方法、装置、设备及存储介质
WO2022105120A1 (fr) Procédé et appareil de détection de texte à partir d'une image, dispositif informatique et support de mémoire
WO2022001233A1 (fr) Procédé de pré-étiquetage basé sur un apprentissage par transfert hiérarchique et dispositif associé
CN112508005B (zh) 用于处理图像的方法、装置、设备以及存储介质
CN112396048B (zh) 图片信息提取方法、装置、计算机设备及存储介质
CN113988223A (zh) 证件图像识别方法、装置、计算机设备及存储介质
CN113742485A (zh) 一种处理文本的方法和装置
CN113837194A (zh) 图像处理方法、图像处理装置、电子设备以及存储介质
CN112651399A (zh) 检测倾斜图像中同行文字的方法及其相关设备
CN116774973A (zh) 数据渲染方法、装置、计算机设备及存储介质
CN115761778A (zh) 一种文献重构方法、装置、设备和存储介质
CN116030375A (zh) 视频特征提取、模型训练方法、装置、设备及存储介质
WO2021151274A1 (fr) Procédé et appareil de traitement de fichier image, dispositif électronique et support d'enregistrement lisible par ordinateur

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893276

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893276

Country of ref document: EP

Kind code of ref document: A1