WO2022262239A1 - Text identification method, apparatus and device, and storage medium - Google Patents

Text identification method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2022262239A1
WO2022262239A1 PCT/CN2021/139972 CN2021139972W WO2022262239A1 WO 2022262239 A1 WO2022262239 A1 WO 2022262239A1 CN 2021139972 W CN2021139972 W CN 2021139972W WO 2022262239 A1 WO2022262239 A1 WO 2022262239A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
picture
convolution
feature
text picture
Prior art date
Application number
PCT/CN2021/139972
Other languages
French (fr)
Chinese (zh)
Inventor
赵坤
杨争艳
吴嘉嘉
殷兵
胡金水
刘聪
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2022262239A1 publication Critical patent/WO2022262239A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of natural language processing, and more specifically, relates to a text recognition method, device, device and storage medium.
  • text recognition is more and more widely used in real life.
  • road sign recognition in automatic driving, photo translation, document scanning and recognition, etc.
  • the present application is proposed to provide a text recognition method, device, device and storage medium to accurately perform text recognition on text pictures to be recognized with diverse directions.
  • the specific plan is as follows:
  • a text recognition method comprising:
  • said extracting image features in at least two different directions from said text picture includes:
  • the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is formed by fusing at least two feature submaps, the The at least two feature submaps are obtained by performing a convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation by the same convolution kernel.
  • said utilizing said convolutional network to extract image features in at least two different directions of said text picture includes:
  • the kernel includes the original convolution kernel and its convolution kernel after at least one rotation;
  • the feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  • said utilizing said convolutional network to extract image features in at least two different directions of said text picture includes:
  • the feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  • the at least two feature subgraphs include:
  • the feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation;
  • the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
  • the identifying the text content contained in the text picture based on the extracted image features in the at least two different directions includes:
  • the recognition network and the convolutional network form a text recognition model, and the text recognition model is trained by using sample picture training data labeled with text content recognition results.
  • said acquiring the text picture to be recognized includes:
  • the original text picture is tilted relative to the horizontal direction
  • the original text picture is rotated to the horizontal direction as the text picture to be recognized.
  • the method further includes:
  • the original text picture in the horizontal direction is rotated by 90 degrees as the text picture to be recognized.
  • the method before inputting the text picture into the pre-built convolutional network, the method also includes:
  • the described text picture is input into the pre-built convolutional network, including:
  • the one with higher confidence is taken as the final recognition result.
  • a text recognition device comprising:
  • a picture acquisition unit configured to acquire a text picture to be recognized, where the text picture is an image area where the text to be recognized is located;
  • a feature extraction unit configured to extract image features in at least two different directions from the text picture
  • a text content identification unit configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
  • a text recognition device comprising: a memory and a processor
  • the memory is used to store programs
  • the processor is configured to execute the program to implement each step of the above-mentioned text recognition method.
  • a storage medium on which a computer program is stored, and when the computer program is executed by a processor, each step of the above-mentioned text recognition method is realized.
  • the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts image features in at least two different directions from the text picture to be recognized, and then based on the extracted at least two different
  • the image features in the orientation identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions.
  • the feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.
  • Figure 1 illustrates schematic diagrams of several text images distributed in different directions
  • FIG. 2 is a schematic flow chart of a text recognition method provided in an embodiment of the present application.
  • Fig. 3 illustrates a schematic diagram of a process of rotating an original text image to a horizontal direction
  • Fig. 4 illustrates a schematic diagram of a process of rotating a text image to be placed horizontally
  • Figure 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel
  • Fig. 6 illustrates a schematic diagram of an identification network architecture of a codec structure
  • FIG. 7 illustrates a schematic diagram of a process of performing a rotation operation on a text image with text flipping
  • FIG. 8 is a schematic structural diagram of an identification processing device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a text recognition device provided by an embodiment of the present application.
  • the solution of this application can be implemented based on a terminal capable of data processing, and the terminal can be a mobile phone, a computer, a server, a cloud, and the like.
  • the text recognition method of the present application may include the following steps:
  • Step S100 acquiring a text image to be recognized.
  • the picture of the text to be recognized is the image area where the text to be recognized is located.
  • the text picture that needs to be recognized can be obtained directly, or the text region detection can be performed on the original picture containing the text to be recognized to obtain the image region where the text to be recognized is located.
  • the text picture acquired in this step may be a text line picture, that is, an image area where a line of text is located.
  • Step S110 extracting image features in at least two different directions from the text image.
  • the direction of the text content of the text picture to be recognized is not fixed.
  • the extracted direction information is strengthened during image feature extraction, that is, from Feature extraction is performed on the text image in two or more different directions, so that the extracted image features include feature information in multiple different directions of the text to be recognized in the text image.
  • image features include feature information in multiple different directions of the text to be recognized in the text image.
  • Step S120 based on the extracted image features in the at least two different directions, identify the text content contained in the text picture.
  • the text content contained in the text picture can be more accurately recognized, improving the accuracy of text recognition .
  • the text recognition method provided in the embodiment of the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts at least two image features in different directions from the text picture to be recognized, and then based on the extracted at least two Image features in different directions to identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions.
  • the feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.
  • step S100 the process of acquiring the text image to be recognized, is further introduced.
  • the original text picture to be recognized is first obtained, and the original text picture is a picture obtained through text region detection.
  • the original text picture may be a text line picture.
  • the shape of the original text picture may also be different.
  • the original text picture may be a rectangle, a parallelogram or other optional shapes. Normally, this application can select a rectangular original text image.
  • the obtained original text image may be inclined relative to the horizontal direction, that is, any side of the original text image is not parallel to the horizontal direction. As shown in FIG. 3 , the left original text image in FIG. 3 is inclined relative to the horizontal direction.
  • the original text picture may be rotated to a horizontal direction as the text picture to be recognized.
  • the rotation direction of the original text picture can be counterclockwise or clockwise, as long as one side of the rotated text picture is parallel to the horizontal direction.
  • the image features can be extracted along the four main directions of the horizontal and vertical axis of the text, and the extracted image features are more convenient.
  • the text picture rotated in the horizontal direction may have two forms, one is that the text picture is placed vertically, and the other is that the text picture is placed horizontally.
  • the specific implementation process may include:
  • the aspect ratio of the original text image in the horizontal direction is further calculated.
  • the aspect ratio threshold may be preset, such as being set to 2 or other optional values.
  • the aspect ratio threshold it can be considered that the text picture is placed vertically, so the original text picture in the horizontal direction can be further rotated by 90 degrees as the text picture to be recognized; if it is determined that the calculated If the aspect ratio does not exceed the set threshold, it can be considered that the text picture is placed horizontally, and it can be directly used as the text picture to be recognized without rotation processing.
  • the above-mentioned process of rotating the original text image in the horizontal direction by 90 degrees may be performed clockwise or counterclockwise. As shown in FIG. 4 , for the text picture on the left side in FIG. 4 , it is placed vertically and can be rotated 90 degrees clockwise.
  • step S110 the process of extracting image features in at least two different directions from the text picture is introduced.
  • This embodiment introduces an optional solution for image feature extraction through convolutional networks. Specifically, this application can pre-train a convolutional network for extracting image features.
  • Convolutional networks can use Resnet29 networks or other forms of networks with convolutional layers.
  • a convolutional network consists of several convolutional layers.
  • the convolution kernel of the convolution layer is set as a shared rotation convolution kernel.
  • the convolution kernel is the weight matrix, and the convolution operation is performed on the feature map output by the previous convolution layer through the convolution kernel.
  • the so-called shared rotation convolution kernel is to keep the parameters of the convolution kernel unchanged (that is, the weights in the weight matrix remain unchanged), and rotate the convolution kernel at least once in units of 90 degrees.
  • the rotated convolution kernel The convolution kernel and the convolution kernel before rotation are shared rotation convolution kernels.
  • the feature map output by each convolutional layer is fused by at least two feature submaps, the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer
  • the feature map is obtained by convolution operation.
  • the above-mentioned at least two feature subgraphs may include:
  • the feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation;
  • the feature map output by the previous convolution layer is convoluted by the rotated convolution kernel The resulting feature subgraph.
  • the convolution kernel performs a convolution operation on the feature map output by the previous convolution layer before rotation, and a corresponding feature submap can be obtained.
  • the convolution kernel can be rotated by 90 degrees, 180 degrees and/or 270 degrees in the above-mentioned manner, and each time it is rotated, the feature map output by the previous convolution layer is used to perform convolution operations on the rotated convolution kernel, namely An additional feature submap can be obtained. For example, if the convolution kernel is rotated once, an additional feature subgraph can be obtained. If the convolution kernel is rotated twice, two additional feature subgraphs can be obtained. If the convolution kernel is rotated three times, an additional feature subgraph can be obtained. Three feature subgraphs.
  • the feature submaps obtained before and after the rotation of the convolution kernel are fused to obtain the output of the current convolution layer.
  • FIG. 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel.
  • the convolution kernel may be rotated in this embodiment, for example, rotated 90 degrees, 180 degrees and 270 degrees counterclockwise, respectively.
  • each convolution kernel before and after rotation is used to perform a convolution operation on the feature map output by the previous layer to obtain feature information in four main directions.
  • the feature subgraphs obtained after the convolution operation of each convolution kernel are F i,0 , F i,90 , F i,180 , and F i,270 .
  • the output of the current layer is the fusion F i of the above four feature submaps.
  • the calculation formula of each feature is as follows:
  • F i cat(F i,0 ,F i,90 ,F i,180 ,F i,270 )
  • cat() represents the splicing operation of features.
  • the size of the fused output feature F i is 4C'*H'*W'.
  • C', H' and W' are the number of channels of the current convolutional layer feature, the height and width of the feature, respectively.
  • the fused output feature F i is input to the next convolutional layer for processing, and so on.
  • the features extracted by each convolutional layer contain different direction information, which strengthens the directionality of the extracted image features.
  • the last convolution The features output by the layers are used as the image features F extracted by the convolutional network.
  • the text image when extracting image features in at least two different directions from a text image, the text image can be input into the convolutional network.
  • the convolutional network uses the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is fused by at least two feature submaps, and the at least two features
  • the sub-images include the convolution operation of the same convolution kernel on the feature map output by the previous convolution layer before rotation and after at least one rotation.
  • two optional implementation architectures of the above-mentioned convolutional network and an implementation process of image feature extraction using the convolutional network are further introduced.
  • the convolution kernel of each convolution layer in the convolution network may include an original convolution kernel and a convolution kernel after at least one rotation of the original convolution kernel.
  • the process of using a convolutional network to extract image features in at least two different directions of the text picture may include:
  • the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation, the features extracted by the original convolution kernel and its rotated convolution kernels respectively can be obtained subplot.
  • the fusion process of the feature submaps extracted by the original convolution kernel and its rotated convolution kernels can be combined with the previous introduction, for example, the feature submaps are stitched together in the channel dimension.
  • the feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  • the convolution kernel of each convolution layer in the convolutional network includes the original convolution kernel, but does not include the rotated convolution kernel.
  • the process of using a convolutional network to extract image features in at least two different directions of the text picture may include:
  • the feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  • the former convolutional network is pre-configured with multiple convolution kernels before and after rotation, and then when performing image feature extraction, each convolution kernel can be directly used for feature extraction. extract.
  • the convolution kernel before rotation is configured. Therefore, when performing image feature extraction, the convolution kernel needs to be rotated at least once before the convolution kernel before and after rotation can be used.
  • the kernel extracts features. Both implementations can realize the extraction of multi-directional image features, which can be selected by technicians according to actual needs.
  • the embodiment of the present application further introduces step S120, based on the extracted images in the at least two different directions feature, the process of identifying the text content contained in the text image.
  • the neural network model can be selected to process the text recognition task, that is, the recognition network can be pre-trained, and the recognition network and the convolutional network together form a text recognition model. Specifically, the output of the convolutional network is used as the input of the recognition network, and the convolutional network and the recognition network are jointly trained.
  • the recognition network can choose a variety of neural network architectures, for example, an Encoder-Decoder encoding and decoding architecture can be used, as shown in FIG. 6 .
  • the encoder Encoder can adopt a bidirectional LSTM (Long Short-Term Memory, long-term short-term memory network) structure, which takes the image feature F output by the convolutional network in the previous step as input, and outputs the hidden state h i of each frame of the encoder.
  • LSTM Long Short-Term Memory, long-term short-term memory network
  • the decoder Decoder can adopt a GRU (Gate Recurrent Unit, gated recurrent unit) or LSTM structure.
  • GRU Gate Recurrent Unit
  • LSTM LSTM structure.
  • the attention mechanism Attention can be used to calculate the correlation between the hidden state st and the hidden state h i of each frame of the encoder, and obtain the context feature vector c t . The calculation process is as follows:
  • o represents the dot multiplication operation
  • T represents the length of the encoder
  • the final text image may also be reversed during the rotation process.
  • a text picture is used as a forward text picture, and the forward text picture is rotated by 180 degrees to obtain a reverse text picture.
  • the one with higher confidence is taken as the final recognition result.
  • the text recognition device provided by the embodiment of the present application is described below, and the text recognition device described below and the text recognition method described above can be referred to in correspondence.
  • FIG. 8 is a schematic structural diagram of a text recognition device disclosed in an embodiment of the present application.
  • the device may include:
  • the picture acquisition unit 11 is used to acquire the text picture to be recognized, and the text picture is the image area where the text to be recognized is located;
  • a feature extraction unit 12 configured to extract image features in at least two different directions for the text picture
  • the text content identification unit 13 is configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
  • the above-mentioned feature extraction unit may include:
  • a convolutional network processing unit configured to input the text picture into a pre-built convolutional network; use the convolutional network to extract image features in at least two different directions of the text picture, wherein each of the convolutional networks
  • the feature map output by a convolutional layer is fused by at least two feature submaps, and the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer
  • the feature map is obtained by convolution operation.
  • the embodiment of the present application provides two optional implementation structures of the convolutional network processing unit, which are as follows:
  • the first type, the convolutional network processing unit includes:
  • the first convolution operation unit is used to use the convolution kernel of each convolution layer in the convolution network to perform a convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel.
  • the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation;
  • the first feature fusion unit is used to fuse the feature submaps extracted by the original convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;
  • the first convolutional output unit is used to use the feature map output by the last convolutional layer of the convolutional network as the image feature of the text picture.
  • the second type, the convolutional network processing unit includes:
  • a convolution kernel rotation unit used to rotate the convolution kernel of each convolution layer in the convolution network at least once
  • the second convolution operation unit is used to perform convolution operation on the feature map output by the previous convolution layer by using the convolution kernel before and after rotation, and obtain the feature submap extracted by each convolution kernel before and after rotation ;
  • the second feature fusion unit is used to fuse the feature submaps extracted by the convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;
  • the second convolution output unit is used to use the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
  • the above at least two feature subgraphs may include:
  • the feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation;
  • the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
  • the above-mentioned text content identification unit may include:
  • a recognition network processing unit configured to input the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network; wherein, the recognition network and The convolutional network forms a text recognition model, and the text recognition model is trained by using sample image training data labeled with text content recognition results.
  • the above image acquisition unit may include:
  • An original picture acquisition unit configured to acquire an original text picture to be recognized, the original text picture being a rectangle;
  • the first rotating unit is configured to rotate the original text picture to the horizontal direction as the text picture to be recognized if it is detected that the original text picture is tilted relative to the horizontal direction.
  • the above picture acquisition unit may also include:
  • an aspect ratio calculation unit configured to calculate the aspect ratio of the original text image in the horizontal direction after the processing by the first rotation unit
  • the second rotation unit is configured to rotate the original text picture in the horizontal direction by 90 degrees as the text picture to be recognized if it is determined that the aspect ratio exceeds the set threshold.
  • the device of the present application may also include:
  • the third rotation unit is configured to use the text picture as a forward text picture and rotate the forward text picture by 180 degrees to obtain a reverse text picture before inputting the text picture into the pre-built convolutional network.
  • the above-mentioned convolutional network processing unit may include:
  • the forward and reverse text picture input unit is used to input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, so that the forward text picture output by the text recognition model contains The text content and its confidence level, and the text content and its confidence level contained in the reverse text picture output by the text recognition model;
  • Confidence degree selection unit is used to use the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with high confidence as the final recognition result.
  • FIG. 9 shows a block diagram of the hardware structure of the text recognition device.
  • the hardware structure of the text recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the mutual communication through the communication bus 4;
  • Processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used for:
  • the embodiment of the present application also provides a storage medium, which can store a program suitable for execution by a processor, and the program is used for:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

Disclosed in the present application are a text identification method, apparatus and device, and a storage medium. The present application comprises: obtaining a text picture corresponding to an image area where text to be identified is located, further extracting image features in at least two different directions from the text picture to be identified, and further identifying text content contained in the text picture on the basis of the extracted image features in the at least two different directions. Hence, for the text picture to be identified, in view of diversification in text content directions, extraction direction information is enhanced during image feature extraction in the present application, that is, feature extraction is performed on the text picture from two or more different directions, so that the extracted image features comprise feature information of the text in the text picture in a plurality of directions. On this basis, on the basis of the extracted image features, the text content comprised in the text picture can be identified more accurately, and the accuracy of text identification is improved.

Description

文本识别方法、装置、设备及存储介质Text recognition method, device, equipment and storage medium
本申请要求于2021年06月16日提交中国专利局、申请号为202110666915.6、发明名称为“文本识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110666915.6 and the title of the invention "text recognition method, device, equipment and storage medium" submitted to the China Patent Office on June 16, 2021, the entire contents of which are incorporated herein by reference In this application.
技术领域technical field
本申请涉及自然语言处理技术领域,更具体的说,是涉及一种文本识别方法、装置、设备及存储介质。The present application relates to the technical field of natural language processing, and more specifically, relates to a text recognition method, device, device and storage medium.
背景技术Background technique
随着文本识别技术的发展,文本识别在现实生活当中的应用越来越广泛。示例如,自动驾驶中的路标识别、拍照翻译、文档扫描识别等。With the development of text recognition technology, text recognition is more and more widely used in real life. For example, road sign recognition in automatic driving, photo translation, document scanning and recognition, etc.
实际生活中,文本区域在场景图片中的方向分布多种多样,如图1所示,包括水平文本、倾斜文本、竖直文本等。由于待识别文本图片的方向性多样化,这给文本识别带来了更大的挑战,如何准确的对此类图片进行文本识别,成为了行业内亟需解决的问题。In real life, the direction distribution of the text area in the scene picture is various, as shown in Figure 1, including horizontal text, oblique text, vertical text, etc. Due to the diversification of the directionality of text pictures to be recognized, this brings greater challenges to text recognition. How to accurately perform text recognition on such pictures has become a problem that needs to be solved urgently in the industry.
发明内容Contents of the invention
鉴于上述问题,提出了本申请以便提供一种文本识别方法、装置、设备及存储介质,以准确的对方向性多样化的待识别文本图片进行文本识别。具体方案如下:In view of the above problems, the present application is proposed to provide a text recognition method, device, device and storage medium to accurately perform text recognition on text pictures to be recognized with diverse directions. The specific plan is as follows:
一种文本识别方法,包括:A text recognition method, comprising:
获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;
对所述文本图片提取至少两个不同方向上的图像特征;extracting image features in at least two different directions from the text picture;
基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。Based on the extracted image features in the at least two different directions, identify text content included in the text picture.
优选地,所述对所述文本图片提取至少两个不同方向上的图像特征,包括:Preferably, said extracting image features in at least two different directions from said text picture includes:
将所述文本图片输入预先构建的卷积网络;Input the text image into a pre-built convolutional network;
利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征, 其中,卷积网络中每一卷积层输出的特征图由至少两个特征子图融合而成,所述至少两个特征子图包括同一卷积核在旋转前及经至少一次旋转后,对前一卷积层输出的特征图进行卷积操作所得。Using the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is formed by fusing at least two feature submaps, the The at least two feature submaps are obtained by performing a convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation by the same convolution kernel.
优选地,所述利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,包括:Preferably, said utilizing said convolutional network to extract image features in at least two different directions of said text picture includes:
利用所述卷积网络中每一卷积层的卷积核对前一卷积层输出的特征图进行卷积操作,得到每一卷积核提取的特征子图,每一卷积层的卷积核包括原始卷积核及其经至少一次旋转后的卷积核;Use the convolution kernel of each convolution layer in the convolution network to perform convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel, and the convolution of each convolution layer The kernel includes the original convolution kernel and its convolution kernel after at least one rotation;
将所述原始卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;The original convolution kernel and the feature submaps extracted by each convolution kernel after rotation are fused, and the fused feature map is input into the next convolution layer;
由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
优选地,所述利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,包括:Preferably, said utilizing said convolutional network to extract image features in at least two different directions of said text picture includes:
对所述卷积网络中每一卷积层的卷积核进行至少一次旋转,并利用旋转前及旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,得到旋转前及旋转后的每一卷积核提取的特征子图;Perform at least one rotation on the convolution kernel of each convolution layer in the convolution network, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature subgraph extracted by each convolution kernel after rotation;
将所述卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;Fusing the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and inputting the fused feature map into the next convolution layer;
由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
优选地,所述至少两个特征子图包括:Preferably, the at least two feature subgraphs include:
同一卷积核在旋转前对前一卷积层输出的特征图进行卷积操作所得的特征子图;以及,The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,
同一卷积核在按照设定方向旋转90度、180度和/或270度后,由旋转后的卷积核对前一卷积层输出的特征图进行卷积操作所得的特征子图。After the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
优选地,所述基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容,包括:Preferably, the identifying the text content contained in the text picture based on the extracted image features in the at least two different directions includes:
将提取的所述至少两个不同方向上的图像特征输入预先构建的识别网络, 得到识别网络输出的所述文本图片中包含的文本内容;Inputting the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network;
其中,所述识别网络和所述卷积网络组成文本识别模型,所述文本识别模型利用标注有文本内容识别结果的样本图片训练数据训练得到。Wherein, the recognition network and the convolutional network form a text recognition model, and the text recognition model is trained by using sample picture training data labeled with text content recognition results.
优选地,所述获取待识别的文本图片,包括:Preferably, said acquiring the text picture to be recognized includes:
获取待识别的原始文本图片;Obtain the original text image to be recognized;
若检测到所述原始文本图片相对于水平方向倾斜,则将所述原始文本图片旋转至水平方向,作为待识别的文本图片。If it is detected that the original text picture is tilted relative to the horizontal direction, the original text picture is rotated to the horizontal direction as the text picture to be recognized.
优选地,在所述将所述原始文本图片旋转至水平方向之后,该方法还包括:Preferably, after said rotating the original text picture to the horizontal direction, the method further includes:
计算水平方向的原始文本图片的高宽比;Calculate the aspect ratio of the original text image in the horizontal direction;
若确定所述高宽比超过设定阈值,则将水平方向的原始文本图片旋转90度,作为待识别的文本图片。If it is determined that the aspect ratio exceeds the set threshold, the original text picture in the horizontal direction is rotated by 90 degrees as the text picture to be recognized.
优选地,在将所述文本图片输入预先构建的卷积网络之前,该方法还包括:Preferably, before inputting the text picture into the pre-built convolutional network, the method also includes:
以所述文本图片作为正向文本图片,将所述正向文本图片旋转180度,得到反向文本图片;Using the text picture as a forward text picture, rotating the forward text picture by 180 degrees to obtain a reverse text picture;
则所述将所述文本图片输入预先构建的卷积网络,包括:Then the described text picture is input into the pre-built convolutional network, including:
将所述正向文本图片和所述反向文本图片分别输入所述文本识别模型中的卷积网络,得到文本识别模型输出的所述正向文本图片包含的文本内容及其置信度,以及文本识别模型输出的所述反向文本图片包含的文本内容及其置信度;Input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, as well as the text The text content contained in the reverse text picture output by the recognition model and its confidence level;
将所述正向文本图片包含的文本内容,和所述反向文本图片包含的文本内容之中,置信度高的一个作为最终识别结果。Among the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with higher confidence is taken as the final recognition result.
一种文本识别装置,包括:A text recognition device, comprising:
图片获取单元,用于获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;A picture acquisition unit, configured to acquire a text picture to be recognized, where the text picture is an image area where the text to be recognized is located;
特征提取单元,用于对所述文本图片提取至少两个不同方向上的图像特征;A feature extraction unit, configured to extract image features in at least two different directions from the text picture;
文本内容识别单元,用于基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。A text content identification unit, configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
一种文本识别设备,包括:存储器和处理器;A text recognition device, comprising: a memory and a processor;
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,用于执行所述程序,实现如上所述的文本识别方法的各个步骤。The processor is configured to execute the program to implement each step of the above-mentioned text recognition method.
一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上所述的文本识别方法的各个步骤。A storage medium, on which a computer program is stored, and when the computer program is executed by a processor, each step of the above-mentioned text recognition method is realized.
借由上述技术方案,本申请获取待识别文本所在的图像区域对应的文本图片,进一步对于待识别的文本图片,对其提取至少两个不同方向上的图像特征,进而基于提取的至少两个不同方向上的图像特征,识别文本图片中包含的文本内容。由此可见,对于待识别的文本图片,鉴于其文本内容方向上的多样化,本申请在进行图像特征提取时强化了所提取的方向信息,也即,从两个及以上的不同方向对文本图片进行特征提取,使得提取的图像特征包含了文本图片中待识别文本的多个方向上的特征信息,在此基础上,基于提取的图像特征,能够更加准确的识别文本图片中所包含的文本内容,提高了文本识别的准确度。With the above technical solution, the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts image features in at least two different directions from the text picture to be recognized, and then based on the extracted at least two different The image features in the orientation identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions. The feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the application. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:
图1示例了几种不同方向分布的文本图片示意图;Figure 1 illustrates schematic diagrams of several text images distributed in different directions;
图2为本申请实施例提供的文本识别方法的一流程示意图;FIG. 2 is a schematic flow chart of a text recognition method provided in an embodiment of the present application;
图3示例了一种将原始文本图片旋转至水平方向的过程示意图;Fig. 3 illustrates a schematic diagram of a process of rotating an original text image to a horizontal direction;
图4示例了一种将文本图片旋转至横向放置的过程示意图;Fig. 4 illustrates a schematic diagram of a process of rotating a text image to be placed horizontally;
图5示例了一种相邻两个卷积层通过共享旋转卷积核进行特征提取的过程示意图;Figure 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel;
图6示例了一种编解码结构的识别网络架构示意图;Fig. 6 illustrates a schematic diagram of an identification network architecture of a codec structure;
图7示例了一种对于文字翻转的文本图片进行旋转操作的过程示意图;FIG. 7 illustrates a schematic diagram of a process of performing a rotation operation on a text image with text flipping;
图8为本申请实施例提供的一种识别处理装置结构示意图;FIG. 8 is a schematic structural diagram of an identification processing device provided by an embodiment of the present application;
图9为本申请实施例提供的文本识别设备的结构示意图。FIG. 9 is a schematic structural diagram of a text recognition device provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
本申请方案可以基于具备数据处理能力的终端实现,该终端可以是手机、电脑、服务器、云端等。The solution of this application can be implemented based on a terminal capable of data processing, and the terminal can be a mobile phone, a computer, a server, a cloud, and the like.
接下来,结合图2所述,本申请的文本识别方法可以包括如下步骤:Next, as described in conjunction with FIG. 2, the text recognition method of the present application may include the following steps:
步骤S100、获取待识别的文本图片。Step S100, acquiring a text image to be recognized.
具体地,待识别的文本图片为待识别文本所在的图像区域。本步骤中,可以直接获取需要进行文本识别的文本图片,也可以是对包含有待识别文本的原始图片进行文本区域检测,得到待识别文本所在的图像区域。Specifically, the picture of the text to be recognized is the image area where the text to be recognized is located. In this step, the text picture that needs to be recognized can be obtained directly, or the text region detection can be performed on the original picture containing the text to be recognized to obtain the image region where the text to be recognized is located.
进一步地,为了便于文本识别,本步骤中获取的文本图片可以是文本行图片,也即一行文本所处的图像区域。Further, in order to facilitate text recognition, the text picture acquired in this step may be a text line picture, that is, an image area where a line of text is located.
步骤S110、对所述文本图片提取至少两个不同方向上的图像特征。Step S110, extracting image features in at least two different directions from the text image.
具体地,待识别的文本图片其文本内容的方向并不固定,为了适应于多种不同方向文本内容的识别,本步骤中在进行图像特征提取时强化了所提取的方向信息,也即,从两个及以上的不同方向对文本图片进行特征提取,使得提取的图像特征包含了文本图片中待识别文本的多个不同方向上的特征信息。当然,对于图像特征的提取方式,可以有多种不同的实现方式,本步骤中不做严格限定。Specifically, the direction of the text content of the text picture to be recognized is not fixed. In order to adapt to the recognition of text content in various directions, in this step, the extracted direction information is strengthened during image feature extraction, that is, from Feature extraction is performed on the text image in two or more different directions, so that the extracted image features include feature information in multiple different directions of the text to be recognized in the text image. Of course, there may be many different implementations for the extraction of image features, which are not strictly limited in this step.
步骤S120、基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。Step S120, based on the extracted image features in the at least two different directions, identify the text content contained in the text picture.
具体地,在提取得到待识别的文本图片的至少两个不同方向上的图像特征之后,基于提取的图像特征,能够更加准确的识别文本图片中所包含的文本内容,提高了文本识别的准确度。Specifically, after extracting the image features in at least two different directions of the text picture to be recognized, based on the extracted image features, the text content contained in the text picture can be more accurately recognized, improving the accuracy of text recognition .
本申请实施例提供的文本识别方法,获取待识别文本所在的图像区域对应的文本图片,进一步对于待识别的文本图片,对其提取至少两个不同方向上的图像特征,进而基于提取的至少两个不同方向上的图像特征,识别文本图片中包含的文本内容。由此可见,对于待识别的文本图片,鉴于其文本内容方向上的多样化,本申请在进行图像特征提取时强化了所提取的方向信息,也即,从两个及以上的不同方向对文本图片进行特征提取,使得提取的图像特征包含了文本图片中待识别文本的多个方向上的特征信息,在此基础上,基于提取的图像特征,能够更加准确的识别文本图片中所包含的文本内容,提高了文本识别的准确度。The text recognition method provided in the embodiment of the present application obtains the text picture corresponding to the image area where the text to be recognized is located, and further extracts at least two image features in different directions from the text picture to be recognized, and then based on the extracted at least two Image features in different directions to identify the text content contained in the text image. It can be seen that, for the text picture to be recognized, in view of the diversification of the direction of the text content, the present application strengthens the extracted direction information when performing image feature extraction, that is, the text is analyzed from two or more different directions. The feature extraction of the picture makes the extracted image features contain the feature information in multiple directions of the text to be recognized in the text picture. On this basis, based on the extracted image features, the text contained in the text picture can be more accurately identified. content, improving the accuracy of text recognition.
在本申请的一些实施例中,进一步对前述步骤S100,获取待识别的文本图片的过程进行介绍。In some embodiments of the present application, the foregoing step S100, the process of acquiring the text image to be recognized, is further introduced.
本实施例中首先获取到待识别的原始文本图片,该原始文本图片为通过文本区域检测所得到的图片。一般性的,原始文本图片可以是文本行图片。根据文本区域检测时所使用的检测手段的不同,原始文本图片的形状也可以不同,示例如,原始文本图片可以是矩形、平行四边形或其它可选形状。通常情况下,本申请可以选取矩形的原始文本图片。In this embodiment, the original text picture to be recognized is first obtained, and the original text picture is a picture obtained through text region detection. Generally, the original text picture may be a text line picture. Depending on the detection method used for text region detection, the shape of the original text picture may also be different. For example, the original text picture may be a rectangle, a parallelogram or other optional shapes. Normally, this application can select a rectangular original text image.
由于文本区域在不同场景下的位置、走势、文字方向等可能会不同,因此获取到的原始文本图片可以会相对于水平方向倾斜,也即原始文本图片的任意一条边均不平行于水平方向,如图3所示,图3中左侧原始文本图片相对于水平方向倾斜。Since the position, trend, and text direction of the text area may be different in different scenarios, the obtained original text image may be inclined relative to the horizontal direction, that is, any side of the original text image is not parallel to the horizontal direction. As shown in FIG. 3 , the left original text image in FIG. 3 is inclined relative to the horizontal direction.
为了更好的供后续步骤中对文本图片提取多个不同方向上的图像特征,本实施例中可以将原始文本图片旋转至水平方向,作为待识别的文本图片。In order to better extract image features in multiple different directions from the text picture in subsequent steps, in this embodiment, the original text picture may be rotated to a horizontal direction as the text picture to be recognized.
对原始文本图片旋转的方向可以是逆时针,也可以是顺时针,只要确保旋转后的文本图片的一条边平行于水平方向即可。The rotation direction of the original text picture can be counterclockwise or clockwise, as long as one side of the rotated text picture is parallel to the horizontal direction.
可以理解的是,经过上述旋转处理后,在利用多个卷积核提取图像特征时,可以沿文字的水平和竖直中轴四个主方向,进行图像特征的提取,提取的图像特征更加便于后续进行文本内容识别。It can be understood that after the above rotation processing, when using multiple convolution kernels to extract image features, the image features can be extracted along the four main directions of the horizontal and vertical axis of the text, and the extracted image features are more convenient. Follow up with text content recognition.
再进一步的,水平方向旋转后的文本图片,其可能会存在两种形式,一种 是文本图片竖直放置,另一种是文本图片横向放置。对于竖直放置的文本图片,其在后续网络处理时难度较高,同时,为了便于数据的统一性,本实施例中可以将文本图片均调整为横向放置,具体实施过程可以包括:Furthermore, the text picture rotated in the horizontal direction may have two forms, one is that the text picture is placed vertically, and the other is that the text picture is placed horizontally. For vertically placed text pictures, it is more difficult to process in the subsequent network. At the same time, in order to facilitate the uniformity of data, in this embodiment, all text pictures can be adjusted to be placed horizontally. The specific implementation process may include:
在前述将原始文本图片旋转至水平方向之后,进一步计算水平方向的原始文本图片的高宽比。After the original text image is rotated to the horizontal direction, the aspect ratio of the original text image in the horizontal direction is further calculated.
具体地,通过计算高宽比,可以判断文本图片是否为竖直放置或水平放置。示例如,可以预先设置高宽比阈值,如设置为2或其它可选数值。当确定计算得到的高宽比超过设定阈值时,可以认为文本图片呈竖直放置,为此可以进一步将水平方向的原始文本图片旋转90度,作为待识别的文本图片;若确定计算得到的高宽比不超过设定阈值,可以认为文本图片呈横向放置,不需要进行旋转处理,即可直接作为待识别的文本图片。Specifically, by calculating the aspect ratio, it can be judged whether the text picture is placed vertically or horizontally. For example, the aspect ratio threshold may be preset, such as being set to 2 or other optional values. When it is determined that the calculated aspect ratio exceeds the set threshold, it can be considered that the text picture is placed vertically, so the original text picture in the horizontal direction can be further rotated by 90 degrees as the text picture to be recognized; if it is determined that the calculated If the aspect ratio does not exceed the set threshold, it can be considered that the text picture is placed horizontally, and it can be directly used as the text picture to be recognized without rotation processing.
其中,上述将水平方向的原始文本图片旋转90度的过程,可以按照顺时针或逆时针方向进行旋转。如图4所示,对于图4中左侧文本图片,其竖直放置,可以按照顺时针方向,旋转90度。Wherein, the above-mentioned process of rotating the original text image in the horizontal direction by 90 degrees may be performed clockwise or counterclockwise. As shown in FIG. 4 , for the text picture on the left side in FIG. 4 , it is placed vertically and can be rotated 90 degrees clockwise.
在本申请的一些实施例中,对上述步骤S110,对所述文本图片提取至少两个不同方向上的图像特征的过程进行介绍。In some embodiments of the present application, the above step S110, the process of extracting image features in at least two different directions from the text picture is introduced.
本实施例中介绍了一种通过卷积网络进行图像特征提取的可选方案。具体地,本申请可以预先训练一个卷积网络,用于进行图像特征的提取。This embodiment introduces an optional solution for image feature extraction through convolutional networks. Specifically, this application can pre-train a convolutional network for extracting image features.
卷积网络可以采用Resnet29网络或其他形式的带有卷积层的网络。Convolutional networks can use Resnet29 networks or other forms of networks with convolutional layers.
卷积网络包括若干个卷积层。为了实现对文本图片提取至少两个不同方向上的图像特征,本实施例中将卷积层的卷积核设置为共享旋转卷积核。这里,卷积核即权值矩阵,通过卷积核对前一卷积层输出的特征图进行卷积操作。所谓的共享旋转卷积核即,保持卷积核的参数不变(即权值矩阵内各权值不变),将卷积核以90度为单位进行至少一次旋转,旋转后的卷积核和旋转前的卷积核互为共享旋转卷积核。A convolutional network consists of several convolutional layers. In order to realize the extraction of image features in at least two different directions from the text picture, in this embodiment, the convolution kernel of the convolution layer is set as a shared rotation convolution kernel. Here, the convolution kernel is the weight matrix, and the convolution operation is performed on the feature map output by the previous convolution layer through the convolution kernel. The so-called shared rotation convolution kernel is to keep the parameters of the convolution kernel unchanged (that is, the weights in the weight matrix remain unchanged), and rotate the convolution kernel at least once in units of 90 degrees. The rotated convolution kernel The convolution kernel and the convolution kernel before rotation are shared rotation convolution kernels.
通过共享旋转卷积核,可以获取文本图片的不同方向上的特征信息。By sharing the rotation convolution kernel, feature information in different directions of the text image can be obtained.
每一卷积层输出的特征图由至少两个特征子图融合而成,该至少两个特征子图包括同一卷积核在旋转前及经过至少一次旋转后,对前一卷积层输出的特 征图进行卷积操作所得。The feature map output by each convolutional layer is fused by at least two feature submaps, the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer The feature map is obtained by convolution operation.
具体地,上述至少两个特征子图可以包括:Specifically, the above-mentioned at least two feature subgraphs may include:
同一卷积核在旋转前对前一卷积层输出的特征图进行卷积操作所得的特征子图;以及,The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,
同一卷积核在按照设定方向(如顺时针或逆时针)旋转90度、180度和/或270度后,由旋转后的卷积核对前一卷积层输出的特征图进行卷积操作所得的特征子图。After the same convolution kernel is rotated 90 degrees, 180 degrees and/or 270 degrees according to the set direction (such as clockwise or counterclockwise), the feature map output by the previous convolution layer is convoluted by the rotated convolution kernel The resulting feature subgraph.
可以理解的是,卷积核在旋转前对前一卷积层输出的特征图进行卷积操作,可以得到对应的一个特征子图。在此基础上,卷积核可以按照上述方式旋转90度、180度和/或270度,每旋转一次,利用旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,即可额外得到一个特征子图,示例如,卷积核旋转一次,可以额外得到一个特征子图,卷积核旋转两次,可以额外得到两个特征子图,卷积核旋转三次,可以额外得到三个特征子图。It can be understood that the convolution kernel performs a convolution operation on the feature map output by the previous convolution layer before rotation, and a corresponding feature submap can be obtained. On this basis, the convolution kernel can be rotated by 90 degrees, 180 degrees and/or 270 degrees in the above-mentioned manner, and each time it is rotated, the feature map output by the previous convolution layer is used to perform convolution operations on the rotated convolution kernel, namely An additional feature submap can be obtained. For example, if the convolution kernel is rotated once, an additional feature subgraph can be obtained. If the convolution kernel is rotated twice, two additional feature subgraphs can be obtained. If the convolution kernel is rotated three times, an additional feature subgraph can be obtained. Three feature subgraphs.
由卷积核旋转前及旋转后得到的各个特征子图进行融合,得到当前卷积层的输出。The feature submaps obtained before and after the rotation of the convolution kernel are fused to obtain the output of the current convolution layer.
结合图5所示,图5示例了一种相邻两个卷积层通过共享旋转卷积核进行特征提取的过程示意图。As shown in FIG. 5 , FIG. 5 illustrates a schematic diagram of a feature extraction process in which two adjacent convolutional layers share a rotation convolution kernel.
定义卷积网络中前一层的输出特征为F i-1,尺寸为C*H*W,其中C为特征的通道数,H为特征的高,W为特征的宽。 Define the output feature of the previous layer in the convolutional network as F i-1 , and the size is C*H*W, where C is the number of channels of the feature, H is the height of the feature, and W is the width of the feature.
将特征F i-1作为当前卷积层的输入,设置卷积核大小为3*3(图5仅仅作为一种示例,卷积核的大小还可以是其它尺寸,本申请对此不做严格限定)。 Use feature F i-1 as the input of the current convolution layer, set the convolution kernel size to 3*3 (Figure 5 is only an example, the size of the convolution kernel can also be other sizes, this application is not strict limited).
为了使得卷积核能够捕捉到各个不同方向上的文本特征,本实施例中可以对卷积核进行旋转,示例如按照逆时针方向分别旋转90度、180度和270度。进一步,利用旋转前及旋转后的各个卷积核,在上一层输出的特征图上做卷积操作,获取四个主要方向上的特征信息。各卷积核卷积操作后得到的特征子图分别为F i,0、F i,90、F i,180、F i,270。当前层的输出为上述四个特征子图的融合F i。各特征的计算公式如下: In order to enable the convolution kernel to capture text features in different directions, the convolution kernel may be rotated in this embodiment, for example, rotated 90 degrees, 180 degrees and 270 degrees counterclockwise, respectively. Further, each convolution kernel before and after rotation is used to perform a convolution operation on the feature map output by the previous layer to obtain feature information in four main directions. The feature subgraphs obtained after the convolution operation of each convolution kernel are F i,0 , F i,90 , F i,180 , and F i,270 . The output of the current layer is the fusion F i of the above four feature submaps. The calculation formula of each feature is as follows:
F i,0=conv 0(F i-1) F i,0 = conv 0 (F i-1 )
F i,90=conv 90(F i-1) F i,90 = conv 90 (F i-1 )
F i,180=conv 180(F i-1) F i,180 = conv 180 (F i-1 )
F i,270=conv 270(F i-1) F i,270 = conv 270 (F i-1 )
F i=cat(F i,0,F i,90,F i,180,F i,270) F i =cat(F i,0 ,F i,90 ,F i,180 ,F i,270 )
其中,cat()表示特征的拼接操作。Among them, cat() represents the splicing operation of features.
融合后的输出特征F i大小为4C′*H′*W′。其中,C′、H′和W′分别为当前卷积层特征的通道数、特征的高和宽。 The size of the fused output feature F i is 4C'*H'*W'. Among them, C', H' and W' are the number of channels of the current convolutional layer feature, the height and width of the feature, respectively.
将融合后的输出特征F i输入下一卷积层进行处理,依次类推,每一卷积层提取的特征都包含了不同的方向信息,强化了提取的图像特征的方向性,最后一个卷积层输出的特征作为卷积网络提取的图像特征F。 The fused output feature F i is input to the next convolutional layer for processing, and so on. The features extracted by each convolutional layer contain different direction information, which strengthens the directionality of the extracted image features. The last convolution The features output by the layers are used as the image features F extracted by the convolutional network.
基于上述介绍的卷积网络,在对文本图片提取至少两个不同方向上的图像特征时,可以将文本图片输入卷积网络。利用卷积网络提取文本图片的至少两个不同方向上的图像特征,其中,卷积网络中每一卷积层输出的特征图由至少两个特征子图融合而成,所述至少两个特征子图包括同一卷积核在旋转前及经至少一次旋转后,对前一卷积层输出的特征图进行卷积操作所得。Based on the convolutional network introduced above, when extracting image features in at least two different directions from a text image, the text image can be input into the convolutional network. Use the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is fused by at least two feature submaps, and the at least two features The sub-images include the convolution operation of the same convolution kernel on the feature map output by the previous convolution layer before rotation and after at least one rotation.
可以理解的是,卷积核经过一次旋转,能提取一个方向上的图像特征,若按照图5示例的旋转方式,则可以提取四个方向上的图像特征。It can be understood that after one rotation of the convolution kernel, image features in one direction can be extracted. If the rotation method shown in Figure 5 is followed, image features in four directions can be extracted.
当然,相比于卷积核不做任何旋转处理的方式,卷积核每多旋转一次,则可以多提取一个方向上的图像特征,进而能够为后续文本识别提供更加准确的图像特征,提升文本识别的准确率。Of course, compared to the method in which the convolution kernel does not perform any rotation processing, every time the convolution kernel rotates one more time, image features in one direction can be extracted, which in turn can provide more accurate image features for subsequent text recognition and improve the quality of text. recognition accuracy.
在本申请的一些实施例中,进一步介绍了上述卷积网络的两种可选实现架构,以及利用卷积网络进行图像特征提取的实施过程。In some embodiments of the present application, two optional implementation architectures of the above-mentioned convolutional network and an implementation process of image feature extraction using the convolutional network are further introduced.
一种可选的方式下,卷积网络中每一卷积层的卷积核可以包括原始卷积核以及原始卷积核经过至少一次旋转后的卷积核。In an optional manner, the convolution kernel of each convolution layer in the convolution network may include an original convolution kernel and a convolution kernel after at least one rotation of the original convolution kernel.
在此基础上,利用卷积网络提取所述文本图片的至少两个不同方向上的图像特征的过程,可以包括:On this basis, the process of using a convolutional network to extract image features in at least two different directions of the text picture may include:
S1、利用卷积网络中每一卷积层的卷积核对前一卷积层输出的特征图进行卷积操作,得到每一卷积核提取的特征子图。S1. Use the convolution kernel of each convolution layer in the convolution network to perform a convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel.
其中,由于每一卷积层的卷积核包括原始卷积核及其经至少一次旋转后的 卷积核,因此可以得到原始卷积核及其经旋转后的各卷积核分别提取的特征子图。Among them, since the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation, the features extracted by the original convolution kernel and its rotated convolution kernels respectively can be obtained subplot.
S2、将所述原始卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层。S2. Fuse the original convolution kernel and the feature submaps extracted by the rotated convolution kernels, and input the fused feature map to the next convolution layer.
具体地,原始卷积核及其旋转后的各卷积核所提取的特征子图的融合过程,可以结合前文相关介绍,示例如将特征子图在通道维度拼接在一起。Specifically, the fusion process of the feature submaps extracted by the original convolution kernel and its rotated convolution kernels can be combined with the previous introduction, for example, the feature submaps are stitched together in the channel dimension.
S3、由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。S3. The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
另一种可选的方式下,卷积网络中每一卷积层的卷积核包括原始卷积核,而并不包括旋转后的卷积核。In another optional manner, the convolution kernel of each convolution layer in the convolutional network includes the original convolution kernel, but does not include the rotated convolution kernel.
在此基础上,利用卷积网络提取所述文本图片的至少两个不同方向上的图像特征的过程,可以包括:On this basis, the process of using a convolutional network to extract image features in at least two different directions of the text picture may include:
S1、对卷积网络中每一卷积层的卷积核进行至少一次旋转,并利用旋转前及旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,得到旋转前及旋转后的每一卷积核提取的特征子图。S1. Rotate the convolution kernel of each convolution layer in the convolutional network at least once, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature submap extracted by each convolution kernel after rotation.
可以理解的是,当卷积网络的卷积层中仅包含原始的卷积核时,为了实现从多个不同方向提取图像特征,在利用卷积网络进行特征提取时,需要首先对每一卷积层的卷积核进行至少一次旋转,进而利用旋转前及旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,得到旋转前及旋转后每一卷积核提取的特征子图。It can be understood that when only the original convolution kernel is included in the convolutional layer of the convolutional network, in order to extract image features from multiple different directions, when using the convolutional network for feature extraction, it is necessary to first extract each volume The convolution kernel of the stacked layer is rotated at least once, and then the feature map output by the previous convolution layer is convolved with the convolution kernel before and after rotation, and the features extracted by each convolution kernel before and after rotation are obtained. subplot.
S2、将所述卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层。S2. Fuse the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and input the fused feature map to the next convolution layer.
S3、由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。S3. The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
对比两种卷积网络的架构可知,前一种卷积网络中预先配置了旋转前及旋转后的多个卷积核,进而在进行图像特征提取时,可以直接利用各卷积核进行特征的提取。后一种卷积网络中仅配置了旋转前的卷积核,为此在进行图像特征提取时,需要先对卷积核进行至少一次旋转,进一步才可以利用旋转前及旋转后的各卷积核进行特征的提取。两种实现方式均可实现多方向图像特征的提 取,具体可以由技术人根据实际需要而选择。Comparing the architectures of the two convolutional networks, it can be seen that the former convolutional network is pre-configured with multiple convolution kernels before and after rotation, and then when performing image feature extraction, each convolution kernel can be directly used for feature extraction. extract. In the latter convolutional network, only the convolution kernel before rotation is configured. Therefore, when performing image feature extraction, the convolution kernel needs to be rotated at least once before the convolution kernel before and after rotation can be used. The kernel extracts features. Both implementations can realize the extraction of multi-directional image features, which can be selected by technicians according to actual needs.
基于前述实施例介绍的利用卷积网络对文本图片提取至少两个不同方向上的图像特征的基础上,本申请实施例中进一步介绍步骤S120,基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容的过程。Based on the use of the convolutional network to extract image features in at least two different directions from the text picture described in the foregoing embodiments, the embodiment of the present application further introduces step S120, based on the extracted images in the at least two different directions feature, the process of identifying the text content contained in the text image.
本实施例中可以选用神经网络模型来处理文本识别任务,也即可以预先训练识别网络,该识别网络和卷积网络共同组成文本识别模型。具体地,卷积网络的输出作为识别网络的输入,卷积网络和识别网络联合进行训练。In this embodiment, the neural network model can be selected to process the text recognition task, that is, the recognition network can be pre-trained, and the recognition network and the convolutional network together form a text recognition model. Specifically, the output of the convolutional network is used as the input of the recognition network, and the convolutional network and the recognition network are jointly trained.
文本识别模型训练时,利用标注有文本内容识别结果的样本图片训练数据进行训练。When training the text recognition model, use the sample picture training data labeled with the text content recognition results for training.
在此基础上,通过将待识别的文本图片输入文本识别模型的卷积网络,可以提取至少两个不同方向上的图像特征,进一步该提取的图像特征输入识别网络,由识别网络输出文本图片中包含的文本内容。On this basis, by inputting the text picture to be recognized into the convolutional network of the text recognition model, image features in at least two different directions can be extracted, and further the extracted image features are input into the recognition network, and the recognition network outputs the text picture Included text content.
其中,识别网络可以选用多种神经网络架构,示例如可以采用Encoder-Decoder编解码架构,如图6所示。Among them, the recognition network can choose a variety of neural network architectures, for example, an Encoder-Decoder encoding and decoding architecture can be used, as shown in FIG. 6 .
其中编码器Encoder可以采用双向LSTM(Long Short-Term Memory,长短期记忆网络)结构,将上一步经过卷积网络输出的图像特征F作为输入,输出编码器每一帧的隐状态h iThe encoder Encoder can adopt a bidirectional LSTM (Long Short-Term Memory, long-term short-term memory network) structure, which takes the image feature F output by the convolutional network in the previous step as input, and outputs the hidden state h i of each frame of the encoder.
解码器Decoder可以采用GRU(Gate Recurrent Unit,门控循环单元)或LSTM结构。对于解码器当前时刻的隐状态s t,可以采用注意力机制Attention,计算隐状态s t与编码器每帧隐状态h i的相关性,获取上下文特征向量c t,计算过程如下: The decoder Decoder can adopt a GRU (Gate Recurrent Unit, gated recurrent unit) or LSTM structure. For the hidden state st of the decoder at the current moment, the attention mechanism Attention can be used to calculate the correlation between the hidden state st and the hidden state h i of each frame of the encoder, and obtain the context feature vector c t . The calculation process is as follows:
e ti=o(s t,h i) e ti =o(s t ,h i )
Figure PCTCN2021139972-appb-000001
Figure PCTCN2021139972-appb-000001
Figure PCTCN2021139972-appb-000002
Figure PCTCN2021139972-appb-000002
其中,o表示点乘操作,T表示编码器长度。Among them, o represents the dot multiplication operation, and T represents the length of the encoder.
最后,解码器当前时刻的文本预测值y t,由当前时刻的隐状态s t和上下文 特征向量c t共同经过线性分类层W解出。 Finally, the text prediction value y t of the decoder at the current moment is solved through the linear classification layer W by the hidden state st at the current moment and the context feature vector c t together.
在本申请的一些实施例中,介绍了文本识别方法的另一种实现过程。In some embodiments of the present application, another implementation process of the text recognition method is introduced.
对于获取的待识别的文本图片,其可能出现文字翻转的问题,如图7所示,图7中上图中的文字是翻转的。For the acquired text picture to be recognized, there may be a problem of text flipping, as shown in Figure 7, the text in the upper picture in Figure 7 is flipped.
此外,结合图3、图4对应的文本图片旋转的过程,在旋转过程也可能会出现最终的文本图片出现文字翻转的情况。In addition, in combination with the process of rotating the text image corresponding to FIG. 3 and FIG. 4 , the final text image may also be reversed during the rotation process.
若将存在文字翻转问题的文本图片输入文本识别模型进行识别,最终识别出的文本内容可能不准确,或者识别出的文本内容语序颠倒。If a text image with a text flip problem is input into the text recognition model for recognition, the final recognized text content may be inaccurate, or the word order of the recognized text content may be reversed.
为此,本实施例中,在将文本图片输入预先构建的卷积网络之前,进一步增加如下处理步骤:For this reason, in this embodiment, before inputting the text picture into the pre-built convolutional network, the following processing steps are further added:
以文本图片作为正向文本图片,将所述正向文本图片旋转180度,得到反向文本图片。A text picture is used as a forward text picture, and the forward text picture is rotated by 180 degrees to obtain a reverse text picture.
在此基础上,将所述正向文本图片和所述反向文本图片分别输入文本识别模型中的卷积网络,得到文本识别模型输出的正向文本图片包含的文本内容及其置信度,以及文本识别模型输出的反向文本图片包含的文本内容及其置信度。On this basis, input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, and The text content contained in the reverse text image output by the text recognition model and its confidence level.
将正向文本图片包含的文本内容,和反向文本图片包含的文本内容之中,置信度高的一个作为最终识别结果。Among the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with higher confidence is taken as the final recognition result.
通过将正向文本图片和反向文本图片分别输入文本识别模型,并选取置信度较高的一个识别后的文本内容作为最终识别结果,能够适应不同方向的文本图片,得到的最终识别结果更加准确。By inputting the forward text picture and the reverse text picture into the text recognition model respectively, and selecting the recognized text content with higher confidence as the final recognition result, it can adapt to text pictures in different directions, and the final recognition result is more accurate .
下面对本申请实施例提供的文本识别装置进行描述,下文描述的文本识别装置与上文描述的文本识别方法可相互对应参照。The text recognition device provided by the embodiment of the present application is described below, and the text recognition device described below and the text recognition method described above can be referred to in correspondence.
参见图8,图8为本申请实施例公开的一种文本识别装置结构示意图。Referring to FIG. 8 , FIG. 8 is a schematic structural diagram of a text recognition device disclosed in an embodiment of the present application.
如图8所示,该装置可以包括:As shown in Figure 8, the device may include:
图片获取单元11,用于获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;The picture acquisition unit 11 is used to acquire the text picture to be recognized, and the text picture is the image area where the text to be recognized is located;
特征提取单元12,用于对所述文本图片提取至少两个不同方向上的图像特征;A feature extraction unit 12, configured to extract image features in at least two different directions for the text picture;
文本内容识别单元13,用于基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。The text content identification unit 13 is configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
可选的,上述特征提取单元可以包括:Optionally, the above-mentioned feature extraction unit may include:
卷积网络处理单元,用于将所述文本图片输入预先构建的卷积网络;利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,其中,卷积网络中每一卷积层输出的特征图由至少两个特征子图融合而成,所述至少两个特征子图包括同一卷积核在旋转前及经至少一次旋转后,对前一卷积层输出的特征图进行卷积操作所得。A convolutional network processing unit, configured to input the text picture into a pre-built convolutional network; use the convolutional network to extract image features in at least two different directions of the text picture, wherein each of the convolutional networks The feature map output by a convolutional layer is fused by at least two feature submaps, and the at least two feature submaps include the same convolution kernel before rotation and after at least one rotation, the output of the previous convolution layer The feature map is obtained by convolution operation.
可选的,本申请实施例提供了卷积网络处理单元的两种可选实现结构,分别如下:Optionally, the embodiment of the present application provides two optional implementation structures of the convolutional network processing unit, which are as follows:
第一种,卷积网络处理单元包括:The first type, the convolutional network processing unit includes:
第一卷积操作单元,用于利用所述卷积网络中每一卷积层的卷积核对前一卷积层输出的特征图进行卷积操作,得到每一卷积核提取的特征子图,每一卷积层的卷积核包括原始卷积核及其经至少一次旋转后的卷积核;The first convolution operation unit is used to use the convolution kernel of each convolution layer in the convolution network to perform a convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel. , the convolution kernel of each convolution layer includes the original convolution kernel and its convolution kernel after at least one rotation;
第一特征融合单元,用于将所述原始卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;The first feature fusion unit is used to fuse the feature submaps extracted by the original convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;
第一卷积输出单元,用于由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The first convolutional output unit is used to use the feature map output by the last convolutional layer of the convolutional network as the image feature of the text picture.
第二种,卷积网络处理单元包括:The second type, the convolutional network processing unit includes:
卷积核旋转单元,用于对所述卷积网络中每一卷积层的卷积核进行至少一次旋转;A convolution kernel rotation unit, used to rotate the convolution kernel of each convolution layer in the convolution network at least once;
第二卷积操作单元,用于利用旋转前及旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,得到旋转前及旋转后的每一卷积核提取的特征子图;The second convolution operation unit is used to perform convolution operation on the feature map output by the previous convolution layer by using the convolution kernel before and after rotation, and obtain the feature submap extracted by each convolution kernel before and after rotation ;
第二特征融合单元,用于将所述卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;The second feature fusion unit is used to fuse the feature submaps extracted by the convolution kernel and the rotated convolution kernels, and input the fused feature map into the next convolution layer;
第二卷积输出单元,用于由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The second convolution output unit is used to use the feature map output by the last convolution layer of the convolution network as the image feature of the text picture.
可选的,上述至少两个特征子图可以包括:Optionally, the above at least two feature subgraphs may include:
同一卷积核在旋转前对前一卷积层输出的特征图进行卷积操作所得的特征子图;以及,The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,
同一卷积核在按照设定方向旋转90度、180度和/或270度后,由旋转后的卷积核对前一卷积层输出的特征图进行卷积操作所得的特征子图。After the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
可选的,上述文本内容识别单元可以包括:Optionally, the above-mentioned text content identification unit may include:
识别网络处理单元,用于将提取的所述至少两个不同方向上的图像特征输入预先构建的识别网络,得到识别网络输出的所述文本图片中包含的文本内容;其中,所述识别网络和所述卷积网络组成文本识别模型,所述文本识别模型利用标注有文本内容识别结果的样本图片训练数据训练得到。A recognition network processing unit, configured to input the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network; wherein, the recognition network and The convolutional network forms a text recognition model, and the text recognition model is trained by using sample image training data labeled with text content recognition results.
可选的,上述图片获取单元可以包括:Optionally, the above image acquisition unit may include:
原始图片获取单元,用于获取待识别的原始文本图片,所述原始文本图片为矩形;An original picture acquisition unit, configured to acquire an original text picture to be recognized, the original text picture being a rectangle;
第一旋转单元,用于若检测到所述原始文本图片相对于水平方向倾斜,则将所述原始文本图片旋转至水平方向,作为待识别的文本图片。The first rotating unit is configured to rotate the original text picture to the horizontal direction as the text picture to be recognized if it is detected that the original text picture is tilted relative to the horizontal direction.
进一步可选的,上述图片获取单元还可以包括:Further optionally, the above picture acquisition unit may also include:
高宽比计算单元,用于在所述第一旋转单元处理之后,计算水平方向的原始文本图片的高宽比;an aspect ratio calculation unit, configured to calculate the aspect ratio of the original text image in the horizontal direction after the processing by the first rotation unit;
第二旋转单元,用于若确定所述高宽比超过设定阈值,则将水平方向的原始文本图片旋转90度,作为待识别的文本图片。The second rotation unit is configured to rotate the original text picture in the horizontal direction by 90 degrees as the text picture to be recognized if it is determined that the aspect ratio exceeds the set threshold.
可选的,本申请的装置还可以包括:Optionally, the device of the present application may also include:
第三旋转单元,用于在将文本图片输入预先构建的卷积网络之前,以所述文本图片作为正向文本图片,将所述正向文本图片旋转180度,得到反向文本图片。在此基础上,上述卷积网络处理单元可以包括:The third rotation unit is configured to use the text picture as a forward text picture and rotate the forward text picture by 180 degrees to obtain a reverse text picture before inputting the text picture into the pre-built convolutional network. On this basis, the above-mentioned convolutional network processing unit may include:
正反向文本图片输入单元,用于将所述正向文本图片和所述反向文本图片分别输入所述文本识别模型中的卷积网络,得到文本识别模型输出的所述正向文本图片包含的文本内容及其置信度,以及文本识别模型输出的所述反向文本图片包含的文本内容及其置信度;The forward and reverse text picture input unit is used to input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, so that the forward text picture output by the text recognition model contains The text content and its confidence level, and the text content and its confidence level contained in the reverse text picture output by the text recognition model;
置信度选取单元,用于将所述正向文本图片包含的文本内容,和所述反向 文本图片包含的文本内容之中,置信度高的一个作为最终识别结果。Confidence degree selection unit is used to use the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with high confidence as the final recognition result.
本申请实施例提供的文本识别装置可应用于文本识别设备,如终端:手机、电脑等。可选的,图9示出了文本识别设备的硬件结构框图,参照图9,文本识别设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;The text recognition device provided in the embodiment of the present application can be applied to text recognition devices, such as terminals: mobile phones, computers, and the like. Optionally, FIG. 9 shows a block diagram of the hardware structure of the text recognition device. Referring to FIG. 9, the hardware structure of the text recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the mutual communication through the communication bus 4;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;Processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;
对所述文本图片提取至少两个不同方向上的图像特征;extracting image features in at least two different directions from the text picture;
基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。Based on the extracted image features in the at least two different directions, identify text content contained in the text picture.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program can refer to the above description.
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:The embodiment of the present application also provides a storage medium, which can store a program suitable for execution by a processor, and the program is used for:
获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;
对所述文本图片提取至少两个不同方向上的图像特征;extracting image features in at least two different directions from the text picture;
基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。Based on the extracted image features in the at least two different directions, identify text content included in the text picture.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program can refer to the above description.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅 仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on the difference from other embodiments. The various embodiments can be combined as needed, and the same and similar parts can be referred to each other. .
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, the present application will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

  1. 一种文本识别方法,其特征在于,包括:A text recognition method is characterized in that, comprising:
    获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;Acquiring a text picture to be recognized, where the text picture is the image area where the text to be recognized is located;
    对所述文本图片提取至少两个不同方向上的图像特征;extracting image features in at least two different directions from the text picture;
    基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。Based on the extracted image features in the at least two different directions, identify text content contained in the text picture.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述文本图片提取至少两个不同方向上的图像特征,包括:The method according to claim 1, wherein said extracting image features in at least two different directions for said text picture comprises:
    将所述文本图片输入预先构建的卷积网络;Input the text image into a pre-built convolutional network;
    利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,其中,卷积网络中每一卷积层输出的特征图由至少两个特征子图融合而成,所述至少两个特征子图包括同一卷积核在旋转前及经至少一次旋转后,对前一卷积层输出的特征图进行卷积操作所得。Using the convolutional network to extract image features in at least two different directions of the text picture, wherein the feature map output by each convolutional layer in the convolutional network is fused by at least two feature submaps, the The at least two feature submaps are obtained by performing a convolution operation on the feature map output by the previous convolution layer before rotation and after at least one rotation by the same convolution kernel.
  3. 根据权利要求2所述的方法,其特征在于,所述利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,包括:The method according to claim 2, wherein said utilizing said convolutional network to extract image features in at least two different directions of said text picture comprises:
    利用所述卷积网络中每一卷积层的卷积核对前一卷积层输出的特征图进行卷积操作,得到每一卷积核提取的特征子图,每一卷积层的卷积核包括原始卷积核及其经至少一次旋转后的卷积核;Use the convolution kernel of each convolution layer in the convolution network to perform convolution operation on the feature map output by the previous convolution layer to obtain the feature submap extracted by each convolution kernel, and the convolution of each convolution layer The kernel includes the original convolution kernel and its convolution kernel after at least one rotation;
    将所述原始卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;The original convolution kernel and the feature submaps extracted by each convolution kernel after rotation are fused, and the fused feature map is input into the next convolution layer;
    由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  4. 根据权利要求2所述的方法,其特征在于,所述利用所述卷积网络提取所述文本图片的至少两个不同方向上的图像特征,包括:The method according to claim 2, wherein said utilizing said convolutional network to extract image features in at least two different directions of said text picture comprises:
    对所述卷积网络中每一卷积层的卷积核进行至少一次旋转,并利用旋转前及旋转后的卷积核对前一卷积层输出的特征图进行卷积操作,得到旋转前及旋转后的每一卷积核提取的特征子图;Perform at least one rotation on the convolution kernel of each convolution layer in the convolution network, and use the convolution kernel before and after rotation to perform a convolution operation on the feature map output by the previous convolution layer to obtain the before and after rotation. The feature subgraph extracted by each convolution kernel after rotation;
    将所述卷积核及其经旋转后的各卷积核所提取的特征子图进行融合,并将融合后的特征图输入下一卷积层;Fusing the feature submaps extracted by the convolution kernel and its rotated convolution kernels, and inputting the fused feature map into the next convolution layer;
    由所述卷积网络的最后一个卷积层输出的特征图作为所述文本图片的图像特征。The feature map output by the last convolutional layer of the convolutional network is used as the image feature of the text picture.
  5. 根据权利要求2所述的方法,其特征在于,所述至少两个特征子图包括:The method according to claim 2, wherein the at least two feature subgraphs comprise:
    同一卷积核在旋转前对前一卷积层输出的特征图进行卷积操作所得的特征子图;以及,The feature submap obtained by convolving the feature map output by the previous convolution layer with the same convolution kernel before rotation; and,
    同一卷积核在按照设定方向旋转90度、180度和/或270度后,由旋转后的卷积核对前一卷积层输出的特征图进行卷积操作所得的特征子图。After the same convolution kernel is rotated by 90 degrees, 180 degrees and/or 270 degrees according to the set direction, the feature submap obtained by performing convolution operation on the feature map output by the previous convolution layer by the rotated convolution kernel.
  6. 根据权利要求2所述的方法,其特征在于,所述基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容,包括:The method according to claim 2, wherein the identifying the text content contained in the text picture based on the extracted image features in the at least two different directions comprises:
    将提取的所述至少两个不同方向上的图像特征输入预先构建的识别网络,得到识别网络输出的所述文本图片中包含的文本内容;Inputting the extracted image features in at least two different directions into a pre-built recognition network to obtain the text content contained in the text picture output by the recognition network;
    其中,所述识别网络和所述卷积网络组成文本识别模型,所述文本识别模型利用标注有文本内容识别结果的样本图片训练数据训练得到。Wherein, the recognition network and the convolutional network form a text recognition model, and the text recognition model is trained by using sample picture training data labeled with text content recognition results.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述获取待识别的文本图片,包括:The method according to any one of claims 1-6, wherein said acquiring the text picture to be recognized comprises:
    获取待识别的原始文本图片;Obtain the original text image to be recognized;
    若检测到所述原始文本图片相对于水平方向倾斜,则将所述原始文本图片旋转至水平方向,作为待识别的文本图片。If it is detected that the original text picture is tilted relative to the horizontal direction, the original text picture is rotated to the horizontal direction as the text picture to be recognized.
  8. 根据权利要求7所述的方法,其特征在于,在所述将所述原始文本图片旋转至水平方向之后,该方法还包括:The method according to claim 7, wherein after said rotating the original text picture to the horizontal direction, the method further comprises:
    计算水平方向的原始文本图片的高宽比;Calculate the aspect ratio of the original text image in the horizontal direction;
    若确定所述高宽比超过设定阈值,则将水平方向的原始文本图片旋转90度,作为待识别的文本图片。If it is determined that the aspect ratio exceeds the set threshold, the original text picture in the horizontal direction is rotated by 90 degrees as the text picture to be recognized.
  9. 根据权利要求6所述的方法,其特征在于,在将所述文本图片输入预先构建的卷积网络之前,该方法还包括:The method according to claim 6, wherein, before inputting the text picture into a pre-built convolutional network, the method further comprises:
    以所述文本图片作为正向文本图片,将所述正向文本图片旋转180度,得到反向文本图片;Using the text picture as a forward text picture, rotating the forward text picture by 180 degrees to obtain a reverse text picture;
    则所述将所述文本图片输入预先构建的卷积网络,包括:Then the described text picture is input into the pre-built convolutional network, including:
    将所述正向文本图片和所述反向文本图片分别输入所述文本识别模型中的卷积网络,得到文本识别模型输出的所述正向文本图片包含的文本内容及其置信度,以及文本识别模型输出的所述反向文本图片包含的文本内容及其置信度;Input the forward text picture and the reverse text picture into the convolutional network in the text recognition model respectively, and obtain the text content contained in the forward text picture output by the text recognition model and its confidence level, as well as the text The text content contained in the reverse text picture output by the recognition model and its confidence level;
    将所述正向文本图片包含的文本内容,和所述反向文本图片包含的文本内容之中,置信度高的一个作为最终识别结果。Among the text content contained in the forward text picture and the text content contained in the reverse text picture, the one with higher confidence is taken as the final recognition result.
  10. 一种文本识别装置,其特征在于,包括:A text recognition device is characterized in that it comprises:
    图片获取单元,用于获取待识别的文本图片,所述文本图片为待识别文本所在的图像区域;A picture acquisition unit, configured to acquire a text picture to be recognized, where the text picture is an image area where the text to be recognized is located;
    特征提取单元,用于对所述文本图片提取至少两个不同方向上的图像特征;A feature extraction unit, configured to extract image features in at least two different directions from the text picture;
    文本内容识别单元,用于基于提取的所述至少两个不同方向上的图像特征,识别所述文本图片中包含的文本内容。A text content identification unit, configured to identify the text content included in the text picture based on the extracted image features in the at least two different directions.
  11. 一种文本识别设备,其特征在于,包括:存储器和处理器;A text recognition device, characterized in that it includes: a memory and a processor;
    所述存储器,用于存储程序;The memory is used to store programs;
    所述处理器,用于执行所述程序,实现如权利要求1~9中任一项所述的文本识别方法的各个步骤。The processor is configured to execute the program to realize each step of the text recognition method according to any one of claims 1-9.
  12. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1~9中任一项所述的文本识别方法的各个步骤。A storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, each step of the text recognition method according to any one of claims 1-9 is realized.
PCT/CN2021/139972 2021-06-16 2021-12-21 Text identification method, apparatus and device, and storage medium WO2022262239A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110666915.6 2021-06-16
CN202110666915.6A CN113392825B (en) 2021-06-16 2021-06-16 Text recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022262239A1 true WO2022262239A1 (en) 2022-12-22

Family

ID=77621485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/139972 WO2022262239A1 (en) 2021-06-16 2021-12-21 Text identification method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN113392825B (en)
WO (1) WO2022262239A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392825B (en) * 2021-06-16 2024-04-30 中国科学技术大学 Text recognition method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN110659633A (en) * 2019-08-15 2020-01-07 坎德拉(深圳)科技创新有限公司 Image text information recognition method and device and storage medium
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment
CN112101351A (en) * 2020-09-07 2020-12-18 凌云光技术股份有限公司 Projection-based text line rotation correction method and device
AU2021100391A4 (en) * 2021-01-22 2021-04-15 GRG Banking Equipment Co.,Ltd Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism
CN113392825A (en) * 2021-06-16 2021-09-14 科大讯飞股份有限公司 Text recognition method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222589B (en) * 2018-11-27 2023-07-18 中国移动通信集团辽宁有限公司 Image text recognition method, device, equipment and computer storage medium
CN111783756B (en) * 2019-04-03 2024-04-16 北京市商汤科技开发有限公司 Text recognition method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN110659633A (en) * 2019-08-15 2020-01-07 坎德拉(深圳)科技创新有限公司 Image text information recognition method and device and storage medium
CN111400497A (en) * 2020-03-19 2020-07-10 北京远鉴信息技术有限公司 Text recognition method and device, storage medium and electronic equipment
CN112101351A (en) * 2020-09-07 2020-12-18 凌云光技术股份有限公司 Projection-based text line rotation correction method and device
AU2021100391A4 (en) * 2021-01-22 2021-04-15 GRG Banking Equipment Co.,Ltd Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism
CN113392825A (en) * 2021-06-16 2021-09-14 科大讯飞股份有限公司 Text recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113392825B (en) 2024-04-30
CN113392825A (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN109977956B (en) Image processing method and device, electronic equipment and storage medium
CN109146892B (en) Image clipping method and device based on aesthetics
US10776671B2 (en) Joint blur map estimation and blur desirability classification from an image
TWI766855B (en) A character recognition method and device
US10134165B2 (en) Image distractor detection and processing
WO2019174130A1 (en) Bill recognition method, server, and computer readable storage medium
CN109241861B (en) Mathematical formula identification method, device, equipment and storage medium
US20220230282A1 (en) Image processing method, image processing apparatus, electronic device and computer-readable storage medium
CN108304775A (en) Remote sensing images recognition methods, device, storage medium and electronic equipment
CN111091123A (en) Text region detection method and equipment
JP2013522971A (en) Image feature detection based on the application of multiple feature detectors
WO2020097909A1 (en) Text detection method and apparatus, and storage medium
CN111539412B (en) Image analysis method, system, device and medium based on OCR
CN110533039A (en) A kind of true-false detection method of license plate, device and equipment
CN109271910A (en) A kind of Text region, character translation method and apparatus
WO2022002262A1 (en) Character sequence recognition method and apparatus based on computer vision, and device and medium
WO2022262239A1 (en) Text identification method, apparatus and device, and storage medium
WO2019148923A1 (en) Method and apparatus for searching for images with image, electronic device, and storage medium
CN111368632A (en) Signature identification method and device
WO2023178930A1 (en) Image recognition method and apparatus, training method and apparatus, system, and storage medium
KR20190080388A (en) Photo Horizon Correction Method based on convolutional neural network and residual network structure
US8270731B2 (en) Image classification using range information
US9665963B1 (en) Dynamic collage layout generation
CN114494775A (en) Video segmentation method, device, equipment and storage medium
CN116977336A (en) Camera defect detection method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945815

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE