WO2023045721A1 - Image language identification method and related device thereof - Google Patents

Image language identification method and related device thereof Download PDF

Info

Publication number
WO2023045721A1
WO2023045721A1 PCT/CN2022/116011 CN2022116011W WO2023045721A1 WO 2023045721 A1 WO2023045721 A1 WO 2023045721A1 CN 2022116011 W CN2022116011 W CN 2022116011W WO 2023045721 A1 WO2023045721 A1 WO 2023045721A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
model
language
feature
text
Prior art date
Application number
PCT/CN2022/116011
Other languages
French (fr)
Chinese (zh)
Inventor
毛晓飞
黄灿
王长虎
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023045721A1 publication Critical patent/WO2023045721A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of image processing, in particular to an image language recognition method and related equipment.
  • the present application provides an image language recognition method and related equipment, which can accurately identify the language to which an image data belongs.
  • An embodiment of the present application provides an image language recognition method, the method comprising:
  • N text images to be used are extracted from the image to be processed; wherein, N is a positive integer;
  • n is a positive integer, n ⁇ N;
  • n is a positive integer , n ⁇ N;
  • the language recognition result of the image to be processed is determined.
  • the visual extraction features include at least one of character density features, color distribution features, and image position features.
  • the determination process of the character density feature of the nth text image to be used includes:
  • the determination process of the image location feature of the nth text image to be used includes:
  • the process of determining the language extraction feature of the nth text image to be used includes:
  • the visual extraction features include character density features, color distribution features, and image position features
  • determining the image extraction feature of the nth text image to be used includes:
  • the language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the color distribution feature of the nth text image to be used, and the nth text image to be used The image position features of the text images are used for splicing to obtain the image extraction features of the nth text image to be used.
  • the determining the language recognition result of the image to be processed according to the image extraction features of the N text images to be used includes:
  • the construction process of the image language recognition model includes:
  • the model to be trained is updated, and the step of combining the at least one sample text image and the at least one sample text image is continued.
  • the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a location feature extraction network, a feature splicing network, and an image language recognition network; wherein, the image The input data of the language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the color feature extraction The output data of the network, and the output data of the location feature extraction network;
  • determining the image language recognition model includes:
  • the image language recognition network in the model to be trained is determined as the image language recognition model.
  • the process of constructing the image language recognition model before inputting the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, the process of constructing the image language recognition model further includes:
  • the extraction network, the density feature extraction network, the color feature extraction network, and the position feature extraction network are initialized.
  • the embodiment of the present application also provides an image language recognition device, including:
  • An image extraction unit configured to extract N text images to be used from the image to be processed according to the text detection result of the image to be processed after acquiring the image to be processed; wherein, N is a positive integer;
  • a feature determination unit configured to determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n ⁇ N;
  • a feature processing unit configured to determine the image extraction features of the nth text image to be used according to the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; Among them, n is a positive integer, n ⁇ N;
  • the language recognition unit is configured to determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
  • the embodiment of the present application also provides a device, the device includes a processor and a memory:
  • the memory is used to store computer programs
  • the processor is configured to execute any implementation of the image language recognition method provided in the embodiments of the present application according to the computer program.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation of the image language recognition method provided in the embodiment of the present application .
  • the embodiment of the present application also provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the image language recognition method provided in the embodiment of the present application.
  • the embodiment of the present application has at least the following advantages:
  • N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed; and then the nth image to be processed is determined.
  • Fig. 1 is a schematic diagram of a kind of image data provided by the embodiment of the present application.
  • FIG. 2 is a flow chart of a method for recognizing an image language provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a text area provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a language feature extraction model provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a density feature extraction model provided in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a color feature extraction model provided in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an image language recognition model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an image language recognition device provided by an embodiment of the present application.
  • the image data shown in FIG. 1 carries character information belonging to Vietnamese and character information belonging to English at the same time, the amount of character information belonging to Vietnamese is far greater than the amount of character information belonging to English. It can be determined that the language of the image data is Vietnamese.
  • the embodiment of the present application provides an image language recognition method, the method includes: after acquiring the image to be processed, first according to the text detection result of the image to be processed, from Extract N text images to be used from the image to be processed; then determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n ⁇ N, N is a positive integer; then, according to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; Among them, n is a positive integer, n ⁇ N, and N is a positive integer; finally, according to the image extraction features of N text images to be used, determine the language recognition result of the image to be processed, so that the language recognition result can accurately represent The language of the image to be processed can be identified, so that the
  • the embodiment of the present application does not limit the subject of execution of the image language recognition method.
  • the image language recognition method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers.
  • the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer.
  • the server can be an independent server, a cluster server or a cloud server.
  • this figure is a flow chart of a method for recognizing an image language provided by an embodiment of the present application.
  • the image language recognition method provided in the embodiment of this application includes S1-S5:
  • N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed.
  • N is a positive integer.
  • the "image to be processed” refers to the image data (for example, the image data shown in Fig. 1 ) that needs image language recognition processing; and the “image to be processed” includes character information in at least one language.
  • the "text detection result of the image to be processed” is used to indicate the location of at least one text region in the image to be processed.
  • the text detection result of the image to be processed may include the position description data of the first text area, the position description data of the second text area , ... , and the location description data of the fifth text area.
  • the "position description data of the first text area” is used to indicate the position of the first text area in the image data shown in Figure 1
  • the "position description data of the second text area” is used to indicate the position of the second text area The position of the region in the image data shown in Figure 1; ... (and so on);
  • the position description data of the fifth text region is used to represent the location of the fifth text region in the image data shown in Figure 1 location.
  • the embodiment of the present application does not limit the determination process of the above-mentioned "text detection result of the image to be processed”. Process text detection results for images.
  • the "text detection model” is used to perform text position detection processing on the input data of the text detection model; and the embodiment of the present application does not limit the "text detection model", and any machine learning model (for example, based on a convolutional neural network) can be used deep learning models, etc.) for implementation.
  • the above “text detection model” can be constructed according to the first sample image and the actual text position of the first sample image.
  • the "actual text position of the first sample image” is used to indicate the actual positions of all text regions in the first sample image in the first sample image; and the embodiment of the present application does not limit the "first sample image
  • the method of obtaining the actual text position of this image for example, can be implemented by manual labeling.
  • the nth text image to be used is used to represent the image information carried by the nth text area in the image to be processed; and this embodiment of the present application does not limit the determination process of the "nth text image to be used", for example, when the above When the "text detection result of the image to be processed" includes the position description data of the nth text area, the image interception process can be performed on the image to be processed according to the position description data of the nth text area to obtain the nth text area to be processed.
  • the text image is used so that the nth text image to be used includes the nth text area.
  • n is a positive integer, n ⁇ N.
  • n is a positive integer, n ⁇ N.
  • the "language extraction feature of the nth text image to be used" is used to represent the language information carried in the nth text region in the image to be processed.
  • the embodiment of the present application does not limit the implementation of S2.
  • it may specifically include: inputting the nth text image to be used into a pre-built language feature extraction model, and obtaining the nth to-be-used text image output by the language feature extraction model. Features are extracted using the language of the text image.
  • language feature extraction model is used to perform language feature extraction processing on the input data of the language feature extraction model; and the embodiment of the present application does not limit the "language feature extraction model", for example, any machine learning model (for example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
  • the above “language feature extraction model” may be constructed according to the first text image and the actual language features of the first text image.
  • the "actual language feature of the first text image” is used to indicate the language information actually carried by the first text image; and this embodiment of the present application does not limit the acquisition method of the "actual language feature of the first text image", for example, It can be implemented by manual labeling.
  • the embodiment of the present application does not limit the construction process of the above "language feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation.
  • the model construction method shown in the second method embodiment can be used for implementation.
  • the embodiment of the present application also provides a possible implementation of the "language feature extraction model", which may specifically include: an image feature extraction layer, a position coding layer, and a first feature fusion layer , the first feature encoding layer and the first linear processing layer.
  • the input data of the first linear processing layer includes the output data of the first feature coding layer;
  • the input data of the first feature coding layer includes the output data of the first feature fusion layer;
  • the input data of the first feature fusion layer includes image feature extraction The output data of the layer and the output data of the position encoding layer.
  • image feature extraction layer is used to perform image feature extraction processing for a text image data (for example, the nth text image to be used); and the embodiment of the present application does not limit the implementation of the "image feature extraction layer", for example , which can be specifically implemented using a convolutional neural network (Convolutional Neural Networks, CNN) shown in FIG. 4 .
  • CNN convolutional Neural Networks
  • position coding layer is used to perform position coding processing for a text image data; and the embodiment of the present application does not limit the implementation of the "position coding layer", for example, it can adopt any position coding processing method (for example, The Positional Encoding module in the transformer model) is implemented.
  • first feature fusion layer is used to perform feature fusion processing (for example, the summation processing shown in Figure 4) for the input data of the first feature fusion layer; and the embodiment of the present application does not limit the "first feature fusion layer", for example, it can be implemented using any feature fusion processing method (for example, the feature fusion processing method involved in the transformer model).
  • first feature encoding layer is used to perform encoding processing on the input data of the first feature encoding layer; and the embodiment of the present application does not limit the "first feature encoding layer", for example, it can adopt the L 1 first encoding network (eg, the Encoder module in the transformer model) is implemented. Wherein, L 1 is a positive integer.
  • first linear processing layer is used to perform linear processing on the input data of the first linear processing layer; and the embodiment of the present application does not limit the implementation of the "first linear processing layer", for example, any linear processing layer can be used processing methods (for example, the linear module in the transformer model) for implementation.
  • CNN is used to represent the above-mentioned “image feature extraction layer”
  • PositionalEncoding is used to represent the above-mentioned “position encoding layer”
  • + is used to represent Yu means the above “first feature fusion layer”
  • Multi-headattention refers to the multi-head self-attention network
  • ADD&norm refers to feature addition processing and feature normalization processing
  • Feed forward refers to feedforward neural network
  • L 1 indicates the number of the first encoding network.
  • n is a positive integer, n ⁇ N.
  • n is a positive integer, n ⁇ N.
  • the "visual extraction feature of the nth text image to be used" is used to represent the image feature information carried by the nth text area in the image to be processed (for example, character density, color distribution, position distribution in the image to be processed wait).
  • the embodiment of the present application does not limit the above "visual extraction features of the nth text image to be used", for example, it may specifically include the character density feature of the nth text image to be used, the nth text image to be used At least one of the color distribution feature of and the image position feature of the nth text image to be used.
  • the "character density feature of the nth text image to be used” is used to represent the distribution density of characters in the nth text image to be used; and the embodiment of the present application does not limit the "character density feature of the nth text image to be used" For example, it may specifically include: inputting the nth text image to be used into a pre-built density feature extraction model, and obtaining the character density feature of the nth text image to be used output by the density feature extraction model.
  • the above “density feature extraction model” is used to perform character density feature extraction processing on the input data of the density feature extraction model; and the embodiment of the present application does not limit the “density feature extraction model", for example, any machine learning model can be used (For example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
  • the above “density feature extraction model” may be constructed according to the second text image and the actual density features of the second text image.
  • the "actual density feature of the second text image” is used to represent the actual character distribution density in the second text image; and this embodiment of the present application does not limit the acquisition method of the "actual density feature of the second text image", for example, It can be implemented by manual labeling.
  • the embodiment of the present application does not limit the construction process of the above "density feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation.
  • the model construction method shown in the second method embodiment can be used for implementation.
  • this embodiment of the present application does not limit the relationship between the above-mentioned "second text image” and the above-mentioned "first text image”.
  • the two may refer to the same text image data, or they may refer to different text images. image data.
  • the embodiment of the present application also provides a possible implementation of the "density feature extraction model", which may specifically include: an image feature extraction layer and L 2 second encoding networks.
  • L 2 is a positive integer, and the embodiment of the present application does not limit L 2 , for example, as shown in FIG. 5 , L 2 may specifically be 4.
  • the embodiment of the present application does not limit the above “second encoding network”, and any existing or future encoding network (for example, the Encoder module in the transformer model, the Encoder module in the conformer model, etc.) can be used for implementation.
  • the above “color distribution feature of the nth text image to be used” is used to represent the color distribution state in the nth text image to be used (especially, the difference between the character color and the background color); and the embodiment of the present application
  • the determination process of the "color distribution feature of the nth text image to be used” is not limited, for example, it may specifically include: inputting the nth text image to be used into a pre-built color feature extraction model to obtain the color feature extraction The color distribution feature of the nth to-be-used text image output by the model.
  • color feature extraction model is used to perform color distribution feature extraction processing on the input data of the color feature extraction model; and the embodiment of the present application does not limit the "color feature extraction model", for example, any machine learning model can be used (For example, a deep learning model based on a convolutional neural network, etc.) is implemented.
  • the above “color feature extraction model” may be constructed according to the third text image and the actual color features of the third text image.
  • the "actual color feature of the third text image” is used to represent the actual color distribution state in the third text image; and the embodiment of the present application does not limit the acquisition method of the "actual color feature of the third text image", for example, It can be implemented by manual labeling.
  • the embodiment of the present application does not limit the construction process of the above "color feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation.
  • the model construction method shown in the second method embodiment can be used for implementation.
  • this embodiment of the present application does not limit the relationship between the above-mentioned "third text image”, the above-mentioned "second text image”, and the above-mentioned "first text image”.
  • the three may refer to the same text image Data can also refer to different text image data.
  • the embodiment of the present application also provides a possible implementation of the "color feature extraction model", which may specifically include: an image feature extraction layer and L 3 second encoding networks.
  • L 3 is a positive integer, and the embodiment of the present application does not limit L 3 , for example, as shown in FIG. 6 , L 3 may specifically be 2.
  • image position feature of the nth text image to be used refers to the position distribution state of the character information in the nth text image to be used in the above-mentioned “image to be processed”; and the embodiment of the present application does not limit the
  • the determination process of "the image location feature of the nth text image to be used" may specifically include: inputting the location description information of the nth text image to be used into a pre-built location feature extraction model to obtain the location feature The image position feature of the nth text image to be used output by the model is extracted.
  • the "position description information of the nth text image to be used” is used to describe the position of the character information in the nth text image to be used in the above "image to be processed”;
  • the determination process of the position description information of n text images to be used for example, when the above-mentioned "nth text image to be used” is used to represent the image information carried by the nth text area in the image to be processed, then the The position description information of the nth text region in the image to be processed is determined as the position description information of the nth text image to be used.
  • position feature extraction model is used to perform image position feature extraction processing on the input data of the position feature extraction model; and the embodiment of the present application does not limit the "position feature extraction model", for example, any machine learning model can be used (For example, a machine learning model based on a fully connected layer, etc.) is implemented.
  • the "location feature extraction model” may include two fully connected layers.
  • location feature extraction model may be constructed according to the location description information of the fourth text image and the actual location characteristics of the fourth text image.
  • the "position description information of the fourth text image” is used to indicate the position of the character information in the fourth text image in the sample image to be processed; dealt with.
  • the "actual position feature of the fourth text image” is used to indicate the actual position distribution state of the character information in the fourth text image in the sample image to be processed; and the embodiment of the present application does not limit the "fourth text image's
  • the method of obtaining "actual location features", for example, can be implemented by manual labeling.
  • the embodiment of the present application does not limit the construction process of the above "location feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation.
  • the model construction method shown in the second method embodiment can be used for implementation.
  • this embodiment of the present application does not limit the relationship between the above-mentioned "fourth text image”, “third text image”, “second text image”, and the above-mentioned "first text image”.
  • the four can be refers to the same text image data, or may refer to different text image data.
  • nth text image to be used Based on the relevant content of the above S3, after the nth text image to be used is obtained, preset visual feature extraction processing (for example, character density feature extraction processing, color distribution feature extraction, etc.) can be performed on the nth text image to be used. processing, and image position feature extraction processing, etc.), to obtain the visual extraction feature of the nth text image to be used, so that the visual extraction feature can represent the image feature information (for example, character density, color, etc.) carried by the image to be processed. distribution, position distribution in the image to be processed, etc.).
  • image feature information for example, character density, color, etc.
  • n is a positive integer, n ⁇ N.
  • n is a positive integer, n ⁇ N.
  • the "image extraction feature of the nth text image to be used” is used to represent the image information carried by the nth text image to be used (for example, character language, character density, color distribution, position in the image to be processed Distribution and other information), so that the "image extraction feature of the nth text image to be used” can accurately represent the image information carried by the nth text area in the image to be processed.
  • S4 may specifically include: taking the nth text image to be used The language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the color distribution feature of the nth text image to be used, and the image position feature of the nth text image to be used are spliced to obtain the nth text image to be used Image extraction features of n text images to be used.
  • this embodiment of the present application does not limit the implementation of the above "stitching".
  • the language extraction feature of the nth text image to be used the character density feature of the nth text image to be used
  • the nth text image When the color distribution feature of the text image to be used and the image position feature of the nth text image to be used are both 1 ⁇ 512 feature vectors, then the image extraction feature of the nth text image to be used can be 4 ⁇ 512 eigenvectors.
  • the nth text image can be determined by referring to the above two extraction features.
  • the image extraction feature of the text image to be used so that the image extraction feature can accurately represent the image information carried by the nth text area in the image to be processed (for example, character language, character density, color distribution, in the image to be processed process information such as position distribution in the image).
  • S5 Determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
  • the "language recognition result of the image to be processed” is used to indicate the language to which the image to be processed belongs, so that the "language recognition result of the image to be processed” can accurately represent the language to which most of the character information in the image to be processed belongs ( For example, Vietnamese as shown in Figure 1).
  • S51 Concatenate the image extraction features of the N text images to be used to obtain language representation data of the image to be processed.
  • the "language representation data of the image to be processed” is used to represent the distribution characteristics of at least one language in the image to be processed (eg, distribution range, distribution location, etc.).
  • this embodiment of the present application does not limit the implementation of "stitching" in S51.
  • the image extraction feature of the nth text image to be used is a 4 ⁇ 512 feature vector
  • the language representation data of the image to be processed can be N x 4 x 512 feature vectors.
  • S52 Input the language representation data of the image to be processed into the pre-built image language recognition model, and obtain the language recognition result of the image to be processed output by the image language recognition model.
  • the "image language recognition model” is used to perform language recognition processing on the input data of the image language recognition model; and the embodiment of the present application does not limit the "image language recognition model", for example, any machine learning model (for example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
  • any machine learning model For example, a deep learning model of a neural network based on self-attention learning, etc.
  • the embodiment of the present application also provides a possible implementation of the "image language recognition model", which may specifically include: L 4 second encoding networks, a second linear processing layer, and recognition layer.
  • L 4 is a positive integer, and the embodiment of the present application does not limit L 4 , for example, as shown in FIG. 7 , L 4 may specifically be 6.
  • second linear processing layer is used to perform linear processing on the input data of the second linear processing layer; and the embodiment of the present application does not limit the implementation of the "second linear processing layer", for example, any linear processing layer can be used processing methods (for example, the linear module in the transformer model) for implementation.
  • the above “recognition layer” is used to perform language classification processing for the input data of the recognition layer; and the embodiment of the present application does not limit the implementation of the "recognition layer", for example, any classification method (for example, in the transformer model softmax module) for implementation.
  • the embodiment of the present application does not limit the construction process of the above "image language recognition model".
  • the "image language recognition model” can be constructed according to the language representation data of the sample image to be used and the actual language of the sample image to be used.
  • the determination process of the "language representation data of the sample image to be used” is similar to the determination process of the above-mentioned "language representation data of the image to be processed".
  • the "actual language of the sample image to be used” is used to indicate the actual language of the sample image to be used.
  • it can be implemented by using the model building process shown in the second method embodiment .
  • the image extraction features of the N text images to be used can be spliced first to obtain the language representation data of the image to be processed; and then the language recognition processing is performed on the language representation data to obtain the language recognition result of the image to be processed , so that the language recognition result can represent the language to which most of the character information in the image to be processed belongs (for example, Vietnamese as shown in FIG. 1 ).
  • the text detection result of the image to be processed is firstly extracted from the image to be processed.
  • N text images to be used determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n ⁇ N, and N is a positive integer ;
  • the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used determine the image extraction feature of the nth text image to be used; wherein, n is a positive integer , n ⁇ N, N is a positive integer;
  • the language recognition result of the image to be processed is determined, so that the language recognition result can accurately indicate that the image to be processed belongs to language, so that the language
  • the embodiment of the present application also provides a model construction method, which may specifically include steps 11-step 16:
  • Step 11 Obtain the sample image to be used and the actual language of the sample image to be used.
  • sample image to be used refers to the image data required for the model building process; and the “sample image to be used” may include character information in at least one language.
  • the "actual language of the sample image to be used” is used to indicate the actual language of the sample image to be used; and the embodiment of the present application does not limit the acquisition method of the "actual language of the sample image to be used", for example, it can be marked manually Get it.
  • Step 12 Determine at least one sample text image and position description information of the at least one sample text image according to the text detection result of the sample image to be used.
  • the "text detection result of the sample image to be used” is used to indicate the location of at least one text region in the sample image to be used; and the determination of the "text detection result of the sample image to be used” The process is similar to the determination process of the "text detection result of the image to be processed" above.
  • sample text image is similar to the determination process of the above “text image to be used”; and the determination process of "position description information of the sample text image” is similar to the above “position description of the text image to be used Information” determination process.
  • Step 13 Input at least one sample text image and the location description information of the at least one sample text image into the model to be trained, and obtain the language recognition result of the sample image to be used output by the model to be trained.
  • model to be trained is used to perform image language recognition processing on the input data of the model to be trained.
  • the embodiment of the present application does not limit the "model to be trained", for example, it may specifically include: language feature extraction network, density feature extraction network, color feature extraction network, location feature extraction network, feature splicing network, and image language recognition network.
  • the input data of the image language recognition network includes the output data of the feature splicing network;
  • the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the output data of the color feature extraction network, and the position The output data of the feature extraction network.
  • language feature extraction network is used to perform language feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "language feature extraction network", for example, it can use Implement the model structure of the above "language feature extraction model”.
  • the above-mentioned "density feature extraction network” is used to perform character density feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "density feature extraction network", for example, it can The model structure of the "Density Feature Extraction Model" above is used for implementation.
  • color feature extraction network is used to perform color distribution feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "color feature extraction network", for example, it can The model structure of the above "color feature extraction model” is used for implementation.
  • position feature extraction network is used to perform image position feature extraction processing on the position description information of text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "position feature extraction network", For example, it can be implemented using the model structure of the above "location feature extraction model”.
  • feature splicing network is used to concatenate the input data of the feature splicing network; and the embodiment of the present application does not limit the working principle of the "feature splicing network”.
  • the working principle of the "feature mosaic network” may specifically include: first extracting features of the language of the k-th sample text image, the k-th sample text image The character density feature of the sample text image, the color distribution feature of the k th sample text image, and the image position feature of the k th sample text image are spliced to obtain the image extraction feature of the k th sample text image; k is A positive integer, k ⁇ K, K is a positive integer; then the image extraction features of the first sample text image are spliced to the image extraction features of the K sample text image to obtain the language representation data of the sample image to be used (that is, , the output of the "feature stitching network” above).
  • image language recognition network is used to perform language recognition processing on the input data of the image language recognition network; and the embodiment of the present application does not limit the network structure of the "image language recognition network", for example, it can adopt the above “image language recognition network”
  • the model structure of "Language Recognition Model” is implemented.
  • the at least one sample text image and its position description information can be input into the model to be trained, so that The model to be trained refers to the at least one sample text image and its location description information to perform image language recognition processing, and obtains and outputs a language recognition result of the sample image to be used.
  • Step 14 Judging whether the preset stop condition is met, if yes, execute step 16; if not, execute step 15.
  • the "preset stop condition" can be preset; and the embodiment of the present application does not limit the "preset stop condition", for example, it can specifically be that the loss value of the model to be trained is lower than the first threshold; it can also be the The rate of change of the loss value of the model to be trained is lower than the second threshold (that is, the image language recognition performance of the model to be trained reaches convergence), and the number of updates of the model to be trained may also reach a third threshold.
  • the above “loss value of the model to be trained” is used to represent the image language recognition performance of the model to be trained; and the embodiment of the present application does not limit the determination process of the "loss value of the model to be trained", existing or future emerging Any method for determining the model loss value is implemented.
  • Step 15 Update the model to be trained according to the language recognition result of the sample image to be used and the actual language of the sample image to be used, and return to step 13.
  • the image language recognition performance of the model to be trained is still relatively poor, so it can be based on the language recognition result of the sample image to be used and The difference between the actual languages of the sample images to be used is to update the model to be trained so that the updated model to be trained can have better image language recognition performance; then continue to perform step 13 based on the updated model to be trained and subsequent steps to implement a new round of training process for the model to be trained.
  • Step 16 Determine the image language recognition model according to the model to be trained.
  • the image language recognition model can be determined according to the model to be trained (For example, the image language recognition network in the model to be trained can be directly determined as the image language recognition model).
  • step 16 may specifically include: determining the language feature extraction network in the model to be trained as the language feature extraction model; determining the density feature extraction network in the model to be trained as the density feature extraction model; extracting the color feature in the model to be trained The network is determined as the color feature extraction model; the position feature extraction network in the model to be trained is determined as the position feature extraction model; the image language recognition network in the model to be trained is determined as the image language recognition model.
  • a language feature extraction model, density feature extraction model, color feature extraction model, location feature extraction model, and image language recognition can be constructed by means of a model training process model, so that the image language recognition method implemented based on these five models has a better image language recognition effect.
  • the embodiment of the present application also provides another possible implementation of the model building method.
  • the model building method may also include Including Step 17-Step 21:
  • Step 17 Using the first text image and the actual language features of the first text image, train the first model, so that the trained first model has a better language feature extraction effect.
  • Step 18 Using the second text image and the actual density features of the second text image to train the second model, so that the trained second model has a better character density feature extraction effect.
  • Step 19 using the third text image and the actual color features of the third text image to train the third model, so that the trained third model has better color distribution feature extraction effect.
  • Step 20 Using the position description information of the fourth text image and the actual position feature of the fourth text image, train the fourth model, so that the trained fourth model has a better image position feature extraction effect.
  • Step 21 Use the trained first model, trained second model, trained third model, and trained fourth model to treat the language feature extraction network, density feature extraction network, and color feature in the training model respectively.
  • the extraction network and the location feature extraction network are initialized.
  • step 21 it may specifically include: determining the trained first model as the initialization processing result of the language feature extraction network in the model to be trained;
  • the second model is determined to be the initialization processing result of the density feature extraction network in the model to be trained;
  • the trained third model is determined to be the initialization processing result of the color feature extraction network in the model to be trained;
  • the trained fourth model is determined to be The initialization processing result of the location feature extraction network in the model to be trained.
  • the first to fifth models can be trained respectively; Initialize the extraction network, density feature extraction network, color feature extraction network, and position feature extraction network to obtain an initialized model to be trained; then, use the above steps 11 to 15 to train the initialized model to be trained processing to obtain a trained model to be trained; finally, a language feature extraction model, a density feature extraction model, a color feature extraction model, a location feature extraction model, and an image language recognition model are determined from the trained model to be trained.
  • the embodiment of the present application also provides an image language recognition device, which will be explained and described below with reference to the accompanying drawings.
  • FIG. 8 is a schematic structural diagram of an image language recognition device provided by an embodiment of the present application.
  • the image language recognition device 800 provided in the embodiment of the present application includes:
  • An image extraction unit 801 configured to extract N text images to be used from the image to be processed according to the text detection result of the image to be processed after acquiring the image to be processed; wherein, N is a positive integer;
  • a feature determining unit 802 configured to determine the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used; wherein, n is a positive integer, n ⁇ N;
  • a feature processing unit 803, configured to determine the image extraction feature of the nth text image to be used according to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used ; Among them, n is a positive integer, n ⁇ N;
  • the language recognition unit 804 is configured to determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
  • the visual extraction features include at least one of character density features, color distribution features, and image position features.
  • the feature determining unit 802 includes:
  • the first determination subunit is configured to input the nth text image to be used into a pre-built density feature extraction model, and obtain the character density feature of the nth text image to be used output by the density feature extraction model;
  • the second determination subunit is configured to input the nth text image to be used into a pre-built color feature extraction model, and obtain the color distribution characteristics of the nth text image to be used output by the color feature extraction model;
  • the third determination subunit is configured to input the position description information of the nth text image to be used into a pre-built position feature extraction model, and obtain the output of the nth text image to be used outputted by the position feature extraction model Image location features.
  • the feature determining unit 802 includes:
  • the fourth determining subunit is configured to input the nth text image to be used into a pre-built language feature extraction model, and obtain the language extraction features of the nth text image to be used output by the language feature extraction model.
  • the visual extraction features include character density features, color distribution features, and image position features
  • the feature processing unit 803 is specifically configured to: extract the language feature of the nth text image to be used, the character density feature of the nth text image to be used, the character density feature of the nth text image to be used, The color distribution feature and the image position feature of the nth text image to be used are spliced to obtain the image extraction feature of the nth text image to be used.
  • the language identification unit 804 is specifically configured to: concatenate the image extraction features of the N text images to be used to obtain language representation data of the image to be processed;
  • the language representation data is input into the pre-built image language recognition model, and the language recognition result of the image to be processed outputted by the image language recognition model is obtained.
  • the image language recognition device 800 further includes:
  • a model training unit configured to acquire the sample image to be used and the actual language of the sample image to be used; determine at least one sample text image and the position of the at least one sample text image according to the text detection result of the sample image to be used Descriptive information; input the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, and obtain the language recognition result of the sample image to be used output by the model to be trained; according to the The language recognition result of the sample image to be used and the actual language of the sample image to be used are updated to the model to be trained, and the description of the position of the at least one sample text image and the at least one sample text image is continued.
  • the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a location feature extraction network, a feature splicing network, and an image language recognition network; wherein, the image The input data of the language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the color feature extraction The output data of the network, and the output data of the location feature extraction network;
  • the process of determining the image language recognition model includes: determining the image language recognition network in the model to be trained as the image language recognition model.
  • the image language recognition device 800 further includes:
  • a model initialization unit configured to use the first text image and the actual language features of the first text image to train the first model; use the second text image and the actual density features of the second text image to train the second model; Utilize the third text image and the actual color feature of the third text image to train the third model; utilize the position description information of the fourth text image and the actual position feature of the fourth text image to train the fourth model; use the training
  • the first model that has been trained, the second model that has been trained, the third model that has been trained, and the fourth model that has been trained are respectively for the language feature extraction network in the model to be trained , the density feature extraction network, the color feature extraction network, and the location feature extraction network are initialized.
  • the image language recognition device 200 After acquiring the image to be processed, it first extracts N texts to be processed from the image to be processed according to the text detection results of the image to be processed.
  • the embodiment of the present application also provides a device, the device includes a processor and a memory:
  • the memory is used to store computer programs
  • the processor is configured to execute any implementation of the image language recognition method provided in the embodiments of the present application according to the computer program.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any of the image language recognition methods provided in the embodiment of the present application.
  • a computer-readable storage medium is used to store a computer program
  • the computer program is used to execute any of the image language recognition methods provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product, which, when running on the terminal device, enables the terminal device to execute any implementation manner of the image language recognition method provided in the embodiment of the present application.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

An image language identification method and a related device thereof. The method comprises: after an image to be processed is obtained, extracting N text images to be used from the image according to a text detection result of the image; then determining a language extraction feature and a visual extraction feature of the n-th text image; then determining an image extraction feature of the n-th text image according to the language extraction feature and the visual extraction feature of the n-th text image, wherein n is a positive integer, n ≤ N, and N is a positive integer; and finally determining a language identification result of the image according to the image extraction features of the N text images, so that the language identification result can accurately represent a language to which the image belongs.

Description

一种图像语种识别方法及其相关设备Image language recognition method and related equipment
本申请要求于2021年09月27日提交中国专利局、申请号为202111138638.8、申请名称为“一种图像语种识别方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111138638.8 and the application title "A Method for Image Language Recognition and Related Equipment" filed with the China Patent Office on September 27, 2021, the entire contents of which are incorporated by reference In this application.
技术领域technical field
本申请涉及图像处理技术领域,尤其涉及一种图像语种识别方法及其相关设备。The present application relates to the technical field of image processing, in particular to an image language recognition method and related equipment.
背景技术Background technique
在一些应用场景下,需要判断一个携带有字符信息的图像数据属于哪一个语种。例如,若一个图像数据携带有大量的汉字,则该图像数据所属语种为中文;若一个图像数据携带有大量的英文单词时,则该图像数据所属语种为英语;……。In some application scenarios, it is necessary to determine which language an image data carrying character information belongs to. For example, if an image data contains a large number of Chinese characters, the language of the image data is Chinese; if an image data contains a large number of English words, the language of the image data is English;  ….
然而,如何识别一个图像数据所属语种是一项亟待解决的技术问题。However, how to identify the language of an image data is an urgent technical problem to be solved.
发明内容Contents of the invention
为了解决上述技术问题,本申请提供了一种图像语种识别方法及其相关设备,能够准确地识别出一个图像数据所属语种。In order to solve the above technical problems, the present application provides an image language recognition method and related equipment, which can accurately identify the language to which an image data belongs.
为了实现上述目的,本申请实施例提供的技术方案如下:In order to achieve the above objectives, the technical solutions provided in the embodiments of the present application are as follows:
本申请实施例提供一种图像语种识别方法,所述方法包括:An embodiment of the present application provides an image language recognition method, the method comprising:
在获取到待处理图像之后,根据所述待处理图像的文本检测结果,从所述待处理图像中提取N个待使用文本图像;其中,N为正整数;After the image to be processed is acquired, according to the text detection result of the image to be processed, N text images to be used are extracted from the image to be processed; wherein, N is a positive integer;
确定所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N;Determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N;
根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N;According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; wherein, n is a positive integer , n≤N;
根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果。According to the image extraction features of the N text images to be used, the language recognition result of the image to be processed is determined.
在一种可能的实施方式中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征中的至少一个。In a possible implementation manner, the visual extraction features include at least one of character density features, color distribution features, and image position features.
在一种可能的实施方式中,所述第n个待使用文本图像的字符密度特征的确定过程,包括:In a possible implementation manner, the determination process of the character density feature of the nth text image to be used includes:
将所述第n个待使用文本图像输入预先构建的密度特征提取模型,得到所述密度特征提取模型输出的所述第n个待使用文本图像的字符密度特征;Inputting the nth text image to be used into a pre-built density feature extraction model to obtain the character density feature of the nth text image to be used output by the density feature extraction model;
所述第n个待使用文本图像的颜色分布特征的确定过程,包括:The process of determining the color distribution characteristics of the nth text image to be used includes:
将所述第n个待使用文本图像输入预先构建的颜色特征提取模型,得到所述颜色特征提取模型输出的所述第n个待使用文本图像的颜色分布特征;Inputting the nth text image to be used into a pre-built color feature extraction model to obtain the color distribution characteristics of the nth text image to be used output by the color feature extraction model;
所述第n个待使用文本图像的图像位置特征的确定过程,包括:The determination process of the image location feature of the nth text image to be used includes:
将所述第n个待使用文本图像的位置描述信息输入预先构建的位置特征提取模型,得到所述位置特征提取模型输出的所述第n个待使用文本图像的图像位置特征。Inputting the position description information of the nth text image to be used into a pre-built position feature extraction model to obtain the image position feature of the nth text image to be used output by the position feature extraction model.
在一种可能的实施方式中,所述第n个待使用文本图像的语种提取特征的确定过程,包括:In a possible implementation manner, the process of determining the language extraction feature of the nth text image to be used includes:
将所述第n个待使用文本图像输入预先构建的语种特征提取模型,得到所述语种特征提取模型输出的所述第n个待使用文本图像的语种提取特征。Inputting the nth text image to be used into a pre-built language feature extraction model to obtain the language extraction features of the nth text image to be used output by the language feature extraction model.
在一种可能的实施方式中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征;In a possible implementation manner, the visual extraction features include character density features, color distribution features, and image position features;
所述根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征,包括:According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determining the image extraction feature of the nth text image to be used includes:
将所述第n个待使用文本图像的语种提取特征、所述第n个待使用文本图像的字符密度特征、所述第n个待使用文本图像的颜色分布特征、以及所述第n个待使用文本图像的图像位置特征进行拼接,得到所述第n个待使用文本图像的图像提取特征。The language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the color distribution feature of the nth text image to be used, and the nth text image to be used The image position features of the text images are used for splicing to obtain the image extraction features of the nth text image to be used.
在一种可能的实施方式中,所述根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果,包括:In a possible implementation manner, the determining the language recognition result of the image to be processed according to the image extraction features of the N text images to be used includes:
将所述N个待使用文本图像的图像提取特征进行拼接,得到所述待处理图像的语种表征数据;splicing the image extraction features of the N text images to be used to obtain language representation data of the image to be processed;
将所述语种表征数据输入预先构建的图像语种识别模型,得到所述图像语种识别模型输出的所述待处理图像的语种识别结果。Inputting the language representation data into a pre-built image language recognition model to obtain a language recognition result of the image to be processed output by the image language recognition model.
在一种可能的实施方式中,所述图像语种识别模型的构建过程,包括:In a possible implementation manner, the construction process of the image language recognition model includes:
获取待使用样本图像和所述待使用样本图像的实际语种;Acquiring the sample image to be used and the actual language of the sample image to be used;
根据所述待使用样本图像的文本检测结果,确定至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息;determining at least one sample text image and location description information of the at least one sample text image according to the text detection result of the sample image to be used;
将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型,得到所述待训练模型输出的所述待使用样本图像的语种识别结果;inputting the at least one sample text image and the location description information of the at least one sample text image into the model to be trained, and obtaining the language recognition result of the sample image to be used output by the model to be trained;
根据所述待使用样本图像的语种识别结果和所述待使用样本图像的实际语种,更新所述待训练模型,并继续执行所述将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型的步骤,直至在达到预设停止条件之后,根据所述待训练模型,确定所述图像语种识别模型。According to the language recognition result of the sample image to be used and the actual language of the sample image to be used, the model to be trained is updated, and the step of combining the at least one sample text image and the at least one sample text image is continued. The step of inputting the location description information into the model to be trained until the preset stop condition is reached, and the image language recognition model is determined according to the model to be trained.
在一种可能的实施方式中,所述待训练模型包括语种特征提取网络、密度特征提取网络、颜色特征提取网络、位置特征提取网络、特征拼接网络、和图像语种识别网络;其中,所述图像语种识别网络的输入数据包括所述特征拼接网络的输出数据;所述特征拼接网络的输入数据包括所述语种特征提取网络的输出数据、所述密度特征提取网络的输出数据、所述颜色特征提取网络的输出数据、以及所述位置特征提取网络的输出数据;In a possible implementation, the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a location feature extraction network, a feature splicing network, and an image language recognition network; wherein, the image The input data of the language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the color feature extraction The output data of the network, and the output data of the location feature extraction network;
所述根据所述待训练模型,确定所述图像语种识别模型,包括:According to the model to be trained, determining the image language recognition model includes:
将所述待训练模型中图像语种识别网络,确定为所述图像语种识别模型。The image language recognition network in the model to be trained is determined as the image language recognition model.
在一种可能的实施方式中,在所述将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型之前,所述图像语种识别模型的构建过程还包括:In a possible implementation manner, before inputting the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, the process of constructing the image language recognition model further includes:
利用第一文本图像和所述第一文本图像的实际语种特征,训练第一模型;Using the first text image and the actual language features of the first text image to train the first model;
利用第二文本图像和所述第二文本图像的实际密度特征,训练第二模型;training a second model using the second text image and the actual density features of the second text image;
利用第三文本图像和所述第三文本图像的实际颜色特征,训练第三模型;using the third text image and the actual color features of the third text image to train a third model;
利用第四文本图像的位置描述信息和所述第四文本图像的实际位置特征,训练第四模 型;Utilize the location description information of the 4th text image and the actual location feature of the 4th text image, train the 4th model;
利用训练好的所述第一模型、训练好的所述第二模型、训练好的所述第三模型、以及训练好的所述第四模型,分别对所述待训练模型中所述语种特征提取网络、所述密度特征提取网络、所述颜色特征提取网络、以及所述位置特征提取网络进行初始化处理。Using the trained first model, the trained second model, the trained third model, and the trained fourth model, respectively for the language features in the model to be trained The extraction network, the density feature extraction network, the color feature extraction network, and the position feature extraction network are initialized.
本申请实施例还提供了一种图像语种识别装置,包括:The embodiment of the present application also provides an image language recognition device, including:
图像提取单元,用于在获取到待处理图像之后,根据所述待处理图像的文本检测结果,从所述待处理图像中提取N个待使用文本图像;其中,N为正整数;An image extraction unit, configured to extract N text images to be used from the image to be processed according to the text detection result of the image to be processed after acquiring the image to be processed; wherein, N is a positive integer;
特征确定单元,用于确定所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N;A feature determination unit, configured to determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N;
特征处理单元,用于根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N;A feature processing unit, configured to determine the image extraction features of the nth text image to be used according to the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; Among them, n is a positive integer, n≤N;
语种识别单元,用于根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果。The language recognition unit is configured to determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
本申请实施例还提供了一种设备,所述设备包括处理器以及存储器:The embodiment of the present application also provides a device, the device includes a processor and a memory:
所述存储器用于存储计算机程序;The memory is used to store computer programs;
所述处理器用于根据所述计算机程序执行本申请实施例提供的图像语种识别方法的任一实施方式。The processor is configured to execute any implementation of the image language recognition method provided in the embodiments of the present application according to the computer program.
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序用于执行本申请实施例提供的图像语种识别方法的任一实施方式。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation of the image language recognition method provided in the embodiment of the present application .
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行本申请实施例提供的图像语种识别方法的任一实施方式。The embodiment of the present application also provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the image language recognition method provided in the embodiment of the present application.
与现有技术相比,本申请实施例至少具有以下优点:Compared with the prior art, the embodiment of the present application has at least the following advantages:
本申请实施例提供的技术方案中,在获取到待处理图像之后,先根据该待处理图像的文本检测结果,从该待处理图像中提取N个待使用文本图像;再确定该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N,N为正整数;然后,根据该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征,确定该第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N,N为正整数;最后,根据N个待使用文本图像的图像提取特征,确定该待处理图像的语种识别结果,以使该语种识别结果能够准确地表示出该待处理图像所属语种,如此能够实现准确地识别出一个图像数据所属语种。In the technical solution provided by the embodiment of the present application, after the image to be processed is acquired, N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed; and then the nth image to be processed is determined. Use the language extraction feature of the text image and the visual extraction feature of the nth text image to be used; wherein, n is a positive integer, n≤N, and N is a positive integer; then, according to the language of the nth text image to be used Extract features and the visual extraction features of the nth text image to be used, determine the image extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N, and N is a positive integer; finally, according to N Image extraction features of a text image to be used, determine the language recognition result of the image to be processed, so that the language recognition result can accurately represent the language to which the image to be processed belongs, so that the language to which an image data belongs can be accurately identified .
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的一种图像数据的示意图;Fig. 1 is a schematic diagram of a kind of image data provided by the embodiment of the present application;
图2为本申请实施例提供的一种图像语种识别方法的流程图;FIG. 2 is a flow chart of a method for recognizing an image language provided by an embodiment of the present application;
图3为本申请实施例提供的一种文本区域的示意图;FIG. 3 is a schematic diagram of a text area provided by an embodiment of the present application;
图4为本申请实施例提供的一种语种特征提取模型的结构示意图;Fig. 4 is a schematic structural diagram of a language feature extraction model provided by the embodiment of the present application;
图5为本申请实施例提供的一种密度特征提取模型的结构示意图;FIG. 5 is a schematic structural diagram of a density feature extraction model provided in an embodiment of the present application;
图6为本申请实施例提供的一种颜色特征提取模型的结构示意图;FIG. 6 is a schematic structural diagram of a color feature extraction model provided in an embodiment of the present application;
图7为本申请实施例提供的一种图像语种识别模型的结构示意图;FIG. 7 is a schematic structural diagram of an image language recognition model provided by an embodiment of the present application;
图8为本申请实施例提供的一种图像语种识别装置的结构示意图。FIG. 8 is a schematic structural diagram of an image language recognition device provided by an embodiment of the present application.
具体实施方式Detailed ways
发明人在针对上述“图像数据所属语种”的研究中发现,对于一个图像数据(如图1所示的图像数据)来说,可以根据该图像数据所携带的大量字符信息所属语种,确定该图像数据所属语种。为了便于理解,下面结合示例进行说明。In the research on the above-mentioned "language of image data", the inventor found that for an image data (such as the image data shown in Figure 1), the image can be determined according to the language of a large number of character information carried by the image data. The language of the data. For ease of understanding, the following description will be given in combination with examples.
作为示例,虽然图1所示的图像数据同时携带有属于越南语的字符信息、以及属于英语的字符信息,但是因属于越南语的字符信息的数量远远大于属于英语的字符信息的数量,故可以确定该图像数据所属语种为越南语。As an example, although the image data shown in FIG. 1 carries character information belonging to Vietnamese and character information belonging to English at the same time, the amount of character information belonging to Vietnamese is far greater than the amount of character information belonging to English. It can be determined that the language of the image data is Vietnamese.
基于上述发现,为了解决背景技术部分的技术问题,本申请实施例提供了一种图像语种识别方法,该方法包括:在获取到待处理图像之后,先根据该待处理图像的文本检测结果,从该待处理图像中提取N个待使用文本图像;再确定该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N,N为正整数;然后,根据该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征,确定该第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N,N为正整数;最后,根据N个待使用文本图像的图像提取特征,确定该待处理图像的语种识别结果,以使该语种识别结果能够准确地表示出该待处理图像所属语种,如此能够实现准确地识别出一个图像数据所属语种。Based on the above findings, in order to solve the technical problems in the background technology, the embodiment of the present application provides an image language recognition method, the method includes: after acquiring the image to be processed, first according to the text detection result of the image to be processed, from Extract N text images to be used from the image to be processed; then determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤ N, N is a positive integer; then, according to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; Among them, n is a positive integer, n≤N, and N is a positive integer; finally, according to the image extraction features of N text images to be used, determine the language recognition result of the image to be processed, so that the language recognition result can accurately represent The language of the image to be processed can be identified, so that the language of an image data can be accurately identified.
另外,本申请实施例不限定图像语种识别方法的执行主体,例如,本申请实施例提供的图像语种识别方法可以应用于终端设备或服务器等数据处理设备。其中,终端设备可以为智能手机、计算机、个人数字助理(Personal Digital Assitant,PDA)或平板电脑等。服务器可以为独立服务器、集群服务器或云服务器。In addition, the embodiment of the present application does not limit the subject of execution of the image language recognition method. For example, the image language recognition method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers. Wherein, the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer. The server can be an independent server, a cluster server or a cloud server.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
方法实施例一Method embodiment one
参见图2,该图为本申请实施例提供的一种图像语种识别方法的流程图。Referring to FIG. 2 , this figure is a flow chart of a method for recognizing an image language provided by an embodiment of the present application.
本申请实施例提供的图像语种识别方法,包括S1-S5:The image language recognition method provided in the embodiment of this application includes S1-S5:
S1:在获取到待处理图像之后,根据该待处理图像的文本检测结果,从该待处理图像中提取N个待使用文本图像。其中,N为正整数。S1: After the image to be processed is acquired, N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed. Wherein, N is a positive integer.
其中,“待处理图像”是指需要进行图像语种识别处理的图像数据(例如,图1所示的图像数据);而且该“待处理图像”包括至少一个语种下的字符信息。Wherein, the "image to be processed" refers to the image data (for example, the image data shown in Fig. 1 ) that needs image language recognition processing; and the "image to be processed" includes character information in at least one language.
“待处理图像的文本检测结果”用于表示该待处理图像中至少一个文本区域在该待处理图像中所处位置。例如,如图3所示,当待处理图像为图1所示的图像数据时,则该待处理图像的文本检测结果可以包括第一文本区域的位置描述数据、第二文本区域的位置描述数据、……、以及第五文本区域的位置描述数据。其中,“第一文本区域的位置描述数据”用于表示该第一文本区域在图1所示的图像数据中所处位置;“第二文本区域的位置描述数据”用于表示该第二文本区域在图1所示的图像数据中所处位置;……(以此类推);“第五文本区域的位置描述数据”用于表示该第五文本区域在图1所示的图像数据中所处位置。The "text detection result of the image to be processed" is used to indicate the location of at least one text region in the image to be processed. For example, as shown in Figure 3, when the image to be processed is the image data shown in Figure 1, the text detection result of the image to be processed may include the position description data of the first text area, the position description data of the second text area , ... , and the location description data of the fifth text area. Among them, the "position description data of the first text area" is used to indicate the position of the first text area in the image data shown in Figure 1; the "position description data of the second text area" is used to indicate the position of the second text area The position of the region in the image data shown in Figure 1; ... (and so on); "The position description data of the fifth text region" is used to represent the location of the fifth text region in the image data shown in Figure 1 location.
需要说明的是,本申请实施示例不限定上述“位置描述数据”的表示方式,例如,可以利用一个文本区域的四个顶点坐标进行表示。It should be noted that, the implementation examples of the present application do not limit the expression manner of the above-mentioned "position description data", for example, four vertex coordinates of a text area may be used for expression.
另外,本申请实施例不限定上述“待处理图像的文本检测结果”的确定过程,例如,其具体可以为:将待处理图像输入预先构建的文本检测模型,得到该文本检测模型输出的该待处理图像的文本检测结果。In addition, the embodiment of the present application does not limit the determination process of the above-mentioned "text detection result of the image to be processed". Process text detection results for images.
“文本检测模型”用于针对该文本检测模型的输入数据进行文本位置检测处理;而且本申请实施例不限定“文本检测模型”,可以利用任一种机器学习模型(例如,基于卷积神经网络的深度学习模型等)进行实施。The "text detection model" is used to perform text position detection processing on the input data of the text detection model; and the embodiment of the present application does not limit the "text detection model", and any machine learning model (for example, based on a convolutional neural network) can be used deep learning models, etc.) for implementation.
上述“文本检测模型”可以根据第一样本图像和该第一样本图像的实际文本位置进行构建。其中,“第一样本图像的实际文本位置”用于表示该第一样本图像中所有文本区域在该第一样本图像中实际所处位置;而且本申请实施例不限定“第一样本图像的实际文本位置”的获取方式,例如,可以通过人工标注的方法进行实施。The above "text detection model" can be constructed according to the first sample image and the actual text position of the first sample image. Among them, the "actual text position of the first sample image" is used to indicate the actual positions of all text regions in the first sample image in the first sample image; and the embodiment of the present application does not limit the "first sample image The method of obtaining the actual text position of this image", for example, can be implemented by manual labeling.
第n个待使用文本图像用于表示待处理图像中第n个文本区域所携带的图像信息;而且本申请实施例不限定该“第n个待使用文本图像”的确定过程,例如,当上述“待处理图像的文本检测结果”包括第n个文本区域的位置描述数据时,可以按照该第n个文本区域的位置描述数据,对该待处理图像进行图像截取处理,得到该第n个待使用文本图像,以使该第n个待使用文本图像包括该第n个文本区域。其中,n为正整数,n≤N。The nth text image to be used is used to represent the image information carried by the nth text area in the image to be processed; and this embodiment of the present application does not limit the determination process of the "nth text image to be used", for example, when the above When the "text detection result of the image to be processed" includes the position description data of the nth text area, the image interception process can be performed on the image to be processed according to the position description data of the nth text area to obtain the nth text area to be processed. The text image is used so that the nth text image to be used includes the nth text area. Wherein, n is a positive integer, n≤N.
基于上述S1的相关内容可知,在获取到待处理图像之后,可以先针对该待处理图像进行文本检测处理,得到该待处理图像的文本检测结果,以使该文本检测结果能够表示出该待处理图像中至少一个文本区域在该待处理图像中所处位置;再根据该文本检测结果,从该待处理图像中提取至少一个待使用文本图像,以使各个待使用文本图像分别包括各个文本区域,从而使得各个待使用文本图像分别能够表示出各个文本区域所携带的图像信息。Based on the relevant content of S1 above, it can be seen that after the image to be processed is acquired, text detection processing can be performed on the image to be processed first, and the text detection result of the image to be processed can be obtained, so that the text detection result can represent the image to be processed The position of at least one text region in the image in the image to be processed; and then extracting at least one text image to be used from the image to be processed according to the text detection result, so that each text image to be used includes each text region, Therefore, each text image to be used can respectively represent the image information carried by each text area.
S2:确定第n个待使用文本图像的语种提取特征。其中,n为正整数,n≤N。S2: Determine the language extraction feature of the nth text image to be used. Wherein, n is a positive integer, n≤N.
其中,“第n个待使用文本图像的语种提取特征”用于表示待处理图像中第n个文本区域所携带的语种信息。Wherein, the "language extraction feature of the nth text image to be used" is used to represent the language information carried in the nth text region in the image to be processed.
另外,本申请实施例不限定S2的实施方式,例如,其具体可以包括:将第n个待使用文本图像输入预先构建的语种特征提取模型,得到该语种特征提取模型输出的该第n个待使用文本图像的语种提取特征。In addition, the embodiment of the present application does not limit the implementation of S2. For example, it may specifically include: inputting the nth text image to be used into a pre-built language feature extraction model, and obtaining the nth to-be-used text image output by the language feature extraction model. Features are extracted using the language of the text image.
上述“语种特征提取模型”用于针对该语种特征提取模型的输入数据进行语种特征提取处理;而且本申请实施例不限定该“语种特征提取模型”,例如,可以采用任一种机器学习模型(例如,基于自注意力学习的神经网络的深度学习模型等)进行实施。The above-mentioned "language feature extraction model" is used to perform language feature extraction processing on the input data of the language feature extraction model; and the embodiment of the present application does not limit the "language feature extraction model", for example, any machine learning model ( For example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
另外,上述“语种特征提取模型”可以根据第一文本图像和该第一文本图像的实际语种特征进行构建。其中,“第一文本图像的实际语种特征”用于表示该第一文本图像实际携带的语种信息;而且本申请实施例不限定该“第一文本图像的实际语种特征”的获取方式,例如,可以通过人工标注的方法进行实施。In addition, the above "language feature extraction model" may be constructed according to the first text image and the actual language features of the first text image. Wherein, the "actual language feature of the first text image" is used to indicate the language information actually carried by the first text image; and this embodiment of the present application does not limit the acquisition method of the "actual language feature of the first text image", for example, It can be implemented by manual labeling.
需要说明的是,本申请实施例不限定上述“语种特征提取模型”的构建过程,例如,可以采用现有的或者未来出现的任一种机器学习模型构建方法进行实施。又如,可以采用 方法实施例二所示的模型构建方法进行实施。 It should be noted that the embodiment of the present application does not limit the construction process of the above "language feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation. As another example, the model construction method shown in the second method embodiment can be used for implementation.
此外,为了提高语种特征的提取效果,本申请实施例还提供了“语种特征提取模型”的一种可能的实施方式,其具体可以包括:图像特征提取层、位置编码层、第一特征融合层、第一特征编码层和第一线性处理层。其中,第一线性处理层的输入数据包括第一特征编码层的输出数据;第一特征编码层的输入数据包括第一特征融合层的输出数据;第一特征融合层的输入数据包括图像特征提取层的输出数据和位置编码层的输出数据。In addition, in order to improve the extraction effect of language features, the embodiment of the present application also provides a possible implementation of the "language feature extraction model", which may specifically include: an image feature extraction layer, a position coding layer, and a first feature fusion layer , the first feature encoding layer and the first linear processing layer. Wherein, the input data of the first linear processing layer includes the output data of the first feature coding layer; the input data of the first feature coding layer includes the output data of the first feature fusion layer; the input data of the first feature fusion layer includes image feature extraction The output data of the layer and the output data of the position encoding layer.
上述“图像特征提取层”用于针对一个文本图像数据(例如,第n个待使用文本图像)进行图像特征提取处理;而且本申请实施例不限定该“图像特征提取层”的实施方式,例如,其具体可以采用图4所示的卷积神经网络(Convolutional Neural Networks,CNN)进行实施。The above-mentioned "image feature extraction layer" is used to perform image feature extraction processing for a text image data (for example, the nth text image to be used); and the embodiment of the present application does not limit the implementation of the "image feature extraction layer", for example , which can be specifically implemented using a convolutional neural network (Convolutional Neural Networks, CNN) shown in FIG. 4 .
上述“位置编码层”用于针对一个文本图像数据进行位置编码处理;而且本申请实施例不限定该“位置编码层”的实施方式,例如,其可以采用任一种位置编码处理方法(例如,transformer模型中Positional Encoding模块)进行实施。The above-mentioned "position coding layer" is used to perform position coding processing for a text image data; and the embodiment of the present application does not limit the implementation of the "position coding layer", for example, it can adopt any position coding processing method (for example, The Positional Encoding module in the transformer model) is implemented.
上述“第一特征融合层”用于针对该第一特征融合层的输入数据进行特征融合处理(例如,图4所示的加和处理);而且本申请实施例不限定该“第一特征融合层”的实施方式,例如,其可以采用任一种特征融合处理方法(例如,transformer模型所涉及的特征融合处理方法)进行实施。The above-mentioned "first feature fusion layer" is used to perform feature fusion processing (for example, the summation processing shown in Figure 4) for the input data of the first feature fusion layer; and the embodiment of the present application does not limit the "first feature fusion layer", for example, it can be implemented using any feature fusion processing method (for example, the feature fusion processing method involved in the transformer model).
上述“第一特征编码层”用于针对该第一特征编码层的输入数据进行编码处理;而且本申请实施例不限定该“第一特征编码层”,例如,其可以采用图4所示的L 1个第一编码网络(如,transformer模型中Encoder模块)进行实施。其中,L 1为正整数。 The above-mentioned "first feature encoding layer" is used to perform encoding processing on the input data of the first feature encoding layer; and the embodiment of the present application does not limit the "first feature encoding layer", for example, it can adopt the L 1 first encoding network (eg, the Encoder module in the transformer model) is implemented. Wherein, L 1 is a positive integer.
上述“第一线性处理层”用于针对该第一线性处理层的输入数据进行线性处理;而且本申请实施例不限定该“第一线性处理层”的实施方式,例如,可以任一种线性处理方法(例如,transformer模型中linear模块)进行实施。The above-mentioned "first linear processing layer" is used to perform linear processing on the input data of the first linear processing layer; and the embodiment of the present application does not limit the implementation of the "first linear processing layer", for example, any linear processing layer can be used processing methods (for example, the linear module in the transformer model) for implementation.
需要说明的是,对于图4所示的语种特征提取模型来说,“CNN”用于表示上述“图像特征提取层”;“Positional Encoding”用于表示上述“位置编码层”;“+”用于表示上述“第一特征融合层”;“Multi-head attention”是指多头自注意力网络;“ADD&norm”是指特征加和处理和特征归一化处理;“Feed forward”是指前馈神经网络;“L 1”表示第一编码网络的个数。 It should be noted that, for the language feature extraction model shown in Figure 4, "CNN" is used to represent the above-mentioned "image feature extraction layer";"PositionalEncoding" is used to represent the above-mentioned "position encoding layer";"+" is used to represent Yu means the above "first feature fusion layer";"Multi-headattention" refers to the multi-head self-attention network; "ADD&norm" refers to feature addition processing and feature normalization processing; "Feed forward" refers to feedforward neural network; "L 1 " indicates the number of the first encoding network.
基于上述S2的相关内容可知,在获取到第n个待使用文本图像之后,可以针对该第n个待使用文本图像进行语种特征提取处理,得到该第n个待使用文本图像的语种提取特征,以使该语种提取特征能够表示出待处理图像中第n个文本区域所携带的语种信息。其中,n为正整数,n≤N。Based on the relevant content of the above S2, after the nth text image to be used is acquired, language feature extraction processing can be performed on the nth text image to be used to obtain the language extraction feature of the nth text image to be used, To enable the language extraction feature to represent the language information carried by the nth text region in the image to be processed. Wherein, n is a positive integer, n≤N.
S3:确定第n个待使用文本图像的视觉提取特征。其中,n为正整数,n≤N。S3: Determine the visual extraction features of the nth text image to be used. Wherein, n is a positive integer, n≤N.
其中,“第n个待使用文本图像的视觉提取特征”用于表示待处理图像中第n个文本区域所携带的图像特征信息(例如,字符密度、颜色分布、在该待处理图像中位置分布等)。Among them, the "visual extraction feature of the nth text image to be used" is used to represent the image feature information carried by the nth text area in the image to be processed (for example, character density, color distribution, position distribution in the image to be processed wait).
另外,本申请实施例不限定上述“第n个待使用文本图像的视觉提取特征”,例如,其具体可以包括该第n个待使用文本图像的字符密度特征、该第n个待使用文本图像的颜色分布特征、和该第n个待使用文本图像的图像位置特征中的至少一个。In addition, the embodiment of the present application does not limit the above "visual extraction features of the nth text image to be used", for example, it may specifically include the character density feature of the nth text image to be used, the nth text image to be used At least one of the color distribution feature of and the image position feature of the nth text image to be used.
“第n个待使用文本图像的字符密度特征”用于表示该第n个待使用文本图像中字符分布密度;而且本申请实施例不限定该“第n个待使用文本图像的字符密度特征”的确定过程,例如,其具体可以包括:将该第n个待使用文本图像输入预先构建的密度特征提取模型,得到该密度特征提取模型输出的该第n个待使用文本图像的字符密度特征。The "character density feature of the nth text image to be used" is used to represent the distribution density of characters in the nth text image to be used; and the embodiment of the present application does not limit the "character density feature of the nth text image to be used" For example, it may specifically include: inputting the nth text image to be used into a pre-built density feature extraction model, and obtaining the character density feature of the nth text image to be used output by the density feature extraction model.
上述“密度特征提取模型”用于针对该密度特征提取模型的输入数据进行字符密度特征提取处理;而且本申请实施例不限定该“密度特征提取模型”,例如,可以采用任一种机器学习模型(例如,基于自注意力学习的神经网络的深度学习模型等)进行实施。The above "density feature extraction model" is used to perform character density feature extraction processing on the input data of the density feature extraction model; and the embodiment of the present application does not limit the "density feature extraction model", for example, any machine learning model can be used (For example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
另外,上述“密度特征提取模型”可以根据第二文本图像和该第二文本图像的实际密度特征进行构建。其中,“第二文本图像的实际密度特征”用于表示该第二文本图像中实际字符分布密度;而且本申请实施例不限定该“第二文本图像的实际密度特征”的获取方式,例如,可以通过人工标注的方法进行实施。In addition, the above "density feature extraction model" may be constructed according to the second text image and the actual density features of the second text image. Wherein, the "actual density feature of the second text image" is used to represent the actual character distribution density in the second text image; and this embodiment of the present application does not limit the acquisition method of the "actual density feature of the second text image", for example, It can be implemented by manual labeling.
需要说明的是,本申请实施例不限定上述“密度特征提取模型”的构建过程,例如,可以采用现有的或者未来出现的任一种机器学习模型构建方法进行实施。又如,可以采用 方法实施例二所示的模型构建方法进行实施。另外,本申请实施例也不限定上述“第二文本图像”与上述“第一文本图像”之间的关联关系,例如,两者可以是指相同的文本图像数据,也可以是指不同的文本图像数据。 It should be noted that the embodiment of the present application does not limit the construction process of the above "density feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation. As another example, the model construction method shown in the second method embodiment can be used for implementation. In addition, this embodiment of the present application does not limit the relationship between the above-mentioned "second text image" and the above-mentioned "first text image". For example, the two may refer to the same text image data, or they may refer to different text images. image data.
此外,为了提高字符密度特征的提取效果,本申请实施例还提供了“密度特征提取模型”的一种可能的实施方式,其具体可以包括:图像特征提取层和L 2个第二编码网络。其中,L 2为正整数,而且本申请实施例不限定L 2,例如,如图5所示,L 2具体可以为4。 In addition, in order to improve the extraction effect of character density features, the embodiment of the present application also provides a possible implementation of the "density feature extraction model", which may specifically include: an image feature extraction layer and L 2 second encoding networks. Wherein, L 2 is a positive integer, and the embodiment of the present application does not limit L 2 , for example, as shown in FIG. 5 , L 2 may specifically be 4.
需要说明的是,“图像特征提取层”的相关内容请参见上文“图像特征提取层”的相关内容。另外,本申请实施例不限定上述“第二编码网络”,可以采用现有的或者未来出现的任一种编码网络(例如,transformer模型中Encoder模块、conformer模型中Encoder模块等)进行实施。It should be noted that, for the relevant content of the "image feature extraction layer", please refer to the relevant content of the above "image feature extraction layer". In addition, the embodiment of the present application does not limit the above "second encoding network", and any existing or future encoding network (for example, the Encoder module in the transformer model, the Encoder module in the conformer model, etc.) can be used for implementation.
上述“第n个待使用文本图像的颜色分布特征”用于表示该第n个待使用文本图像中颜色分布状态(尤其是,字符颜色与背景颜色之间的差异性);而且本申请实施例不限定该“第n个待使用文本图像的颜色分布特征”的确定过程,例如,其具体可以包括:将该第n个待使用文本图像输入预先构建的颜色特征提取模型,得到该颜色特征提取模型输出的该第n个待使用文本图像的颜色分布特征。The above "color distribution feature of the nth text image to be used" is used to represent the color distribution state in the nth text image to be used (especially, the difference between the character color and the background color); and the embodiment of the present application The determination process of the "color distribution feature of the nth text image to be used" is not limited, for example, it may specifically include: inputting the nth text image to be used into a pre-built color feature extraction model to obtain the color feature extraction The color distribution feature of the nth to-be-used text image output by the model.
上述“颜色特征提取模型”用于针对该颜色特征提取模型的输入数据进行颜色分布特征提取处理;而且本申请实施例不限定该“颜色特征提取模型”,例如,可以采用任一种机器学习模型(例如,基于卷积神经网络的深度学习模型等)进行实施。The above-mentioned "color feature extraction model" is used to perform color distribution feature extraction processing on the input data of the color feature extraction model; and the embodiment of the present application does not limit the "color feature extraction model", for example, any machine learning model can be used (For example, a deep learning model based on a convolutional neural network, etc.) is implemented.
另外,上述“颜色特征提取模型”可以根据第三文本图像和该第三文本图像的实际颜色特征进行构建。其中,“第三文本图像的实际颜色特征”用于表示该第三文本图像中实际颜色分布状态;而且本申请实施例不限定该“第三文本图像的实际颜色特征”的获取方式,例如,可以通过人工标注的方法进行实施。In addition, the above "color feature extraction model" may be constructed according to the third text image and the actual color features of the third text image. Wherein, the "actual color feature of the third text image" is used to represent the actual color distribution state in the third text image; and the embodiment of the present application does not limit the acquisition method of the "actual color feature of the third text image", for example, It can be implemented by manual labeling.
需要说明的是,本申请实施例不限定上述“颜色特征提取模型”的构建过程,例如,可以采用现有的或者未来出现的任一种机器学习模型构建方法进行实施。又如,可以采用 方法实施例二所示的模型构建方法进行实施。另外,本申请实施例也不限定上述“第三文本图像”、上述“第二文本图像”、以及上述“第一文本图像”之间的关联关系,例如,三者可以是指相同的文本图像数据,也可以是指不同的文本图像数据。 It should be noted that the embodiment of the present application does not limit the construction process of the above "color feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation. As another example, the model construction method shown in the second method embodiment can be used for implementation. In addition, this embodiment of the present application does not limit the relationship between the above-mentioned "third text image", the above-mentioned "second text image", and the above-mentioned "first text image". For example, the three may refer to the same text image Data can also refer to different text image data.
此外,为了提高颜色分布特征的提取效果,本申请实施例还提供了“颜色特征提取模型”的一种可能的实施方式,其具体可以包括:图像特征提取层和L 3个第二编码网络。其中,L 3为正整数,而且本申请实施例不限定L 3,例如,如图6所示,L 3具体可以为2。 In addition, in order to improve the extraction effect of color distribution features, the embodiment of the present application also provides a possible implementation of the "color feature extraction model", which may specifically include: an image feature extraction layer and L 3 second encoding networks. Wherein, L 3 is a positive integer, and the embodiment of the present application does not limit L 3 , for example, as shown in FIG. 6 , L 3 may specifically be 2.
需要说明的是,“图像特征提取层”的相关内容请参见上文“图像特征提取层”的相关内容;“第二编码网络”的相关内容请参见上文“第二编码网络”的相关内容。It should be noted that, for the relevant content of the "image feature extraction layer", please refer to the relevant content of the above "image feature extraction layer"; for the relevant content of the "second encoding network", please refer to the relevant content of the above "second encoding network" .
上述“第n个待使用文本图像的图像位置特征”是指该第n个待使用文本图像中字符信息在上述“待处理图像”中所呈现的位置分布状态;而且本申请实施例不限定该“第n个待使用文本图像的图像位置特征”的确定过程,例如,其具体可以包括:将该第n个待使用文本图像的位置描述信息输入预先构建的位置特征提取模型,得到该位置特征提取模型输出的该第n个待使用文本图像的图像位置特征。The above-mentioned "image position feature of the nth text image to be used" refers to the position distribution state of the character information in the nth text image to be used in the above-mentioned "image to be processed"; and the embodiment of the present application does not limit the The determination process of "the image location feature of the nth text image to be used", for example, may specifically include: inputting the location description information of the nth text image to be used into a pre-built location feature extraction model to obtain the location feature The image position feature of the nth text image to be used output by the model is extracted.
其中,“第n个待使用文本图像的位置描述信息”用于描述该第n个待使用文本图像中字符信息在上述“待处理图像”中所处位置;而且本申请实施例不限定“第n个待使用文本图像的位置描述信息”的确定过程,例如,当上述“第n个待使用文本图像”用于表示待处理图像中第n个文本区域所携带的图像信息时,则可以将该待处理图像中第n个文本区域的位置描述信息,确定为该第n个待使用文本图像的位置描述信息。Among them, the "position description information of the nth text image to be used" is used to describe the position of the character information in the nth text image to be used in the above "image to be processed"; The determination process of the position description information of n text images to be used", for example, when the above-mentioned "nth text image to be used" is used to represent the image information carried by the nth text area in the image to be processed, then the The position description information of the nth text region in the image to be processed is determined as the position description information of the nth text image to be used.
上述“位置特征提取模型”用于针对该位置特征提取模型的输入数据进行图像位置特征提取处理;而且本申请实施例不限定该“位置特征提取模型”,例如,可以采用任一种机器学习模型(例如,基于全连接层的机器学习模型等)进行实施。又如,该“位置特征提取模型”可以包括2个全连接层。The above-mentioned "position feature extraction model" is used to perform image position feature extraction processing on the input data of the position feature extraction model; and the embodiment of the present application does not limit the "position feature extraction model", for example, any machine learning model can be used (For example, a machine learning model based on a fully connected layer, etc.) is implemented. As another example, the "location feature extraction model" may include two fully connected layers.
另外,上述“位置特征提取模型”可以根据第四文本图像的位置描述信息和该第四文本图像的实际位置特征进行构建。In addition, the above "location feature extraction model" may be constructed according to the location description information of the fourth text image and the actual location characteristics of the fourth text image.
其中,“第四文本图像的位置描述信息”用于表示该第四文本图像中字符信息在待处理样本图像中所处位置;而且该第四文本图像是通过对该待处理样本图像进行图像截取处理得到的。Among them, the "position description information of the fourth text image" is used to indicate the position of the character information in the fourth text image in the sample image to be processed; dealt with.
“第四文本图像的实际位置特征”用于表示该第四文本图像中字符信息在在待处理样本图像中所呈现的实际位置分布状态;而且本申请实施例不限定该“第四文本图像的实际位置特征”的获取方式,例如,可以通过人工标注的方法进行实施。The "actual position feature of the fourth text image" is used to indicate the actual position distribution state of the character information in the fourth text image in the sample image to be processed; and the embodiment of the present application does not limit the "fourth text image's The method of obtaining "actual location features", for example, can be implemented by manual labeling.
需要说明的是,本申请实施例不限定上述“位置特征提取模型”的构建过程,例如,可以采用现有的或者未来出现的任一种机器学习模型构建方法进行实施。又如,可以采用 方法实施例二所示的模型构建方法进行实施。另外,本申请实施例也不限定上述“第四文本图像”、“第三文本图像”、“第二文本图像”、以及上述“第一文本图像”之间的关联关系,例如,四者可以是指相同的文本图像数据,也可以是指不同的文本图像数据。 It should be noted that the embodiment of the present application does not limit the construction process of the above "location feature extraction model", for example, any existing or future machine learning model construction method can be used for implementation. As another example, the model construction method shown in the second method embodiment can be used for implementation. In addition, this embodiment of the present application does not limit the relationship between the above-mentioned "fourth text image", "third text image", "second text image", and the above-mentioned "first text image". For example, the four can be refers to the same text image data, or may refer to different text image data.
基于上述S3的相关内容可知,在获取到第n个待使用文本图像之后,可以针对该第n个待使用文本图像进行预设视觉特征提取处理(例如,字符密度特征提取处理、颜色分布特征提取处理、以及图像位置特征提取处理等),得到该第n个待使用文本图像的视觉提取特征,以使该视觉提取特征能够表示出待处理图像所携带的图像特征信息(例如,字符密度、颜色分布、在该待处理图像中位置分布等)。其中,n为正整数,n≤N。Based on the relevant content of the above S3, after the nth text image to be used is obtained, preset visual feature extraction processing (for example, character density feature extraction processing, color distribution feature extraction, etc.) can be performed on the nth text image to be used. processing, and image position feature extraction processing, etc.), to obtain the visual extraction feature of the nth text image to be used, so that the visual extraction feature can represent the image feature information (for example, character density, color, etc.) carried by the image to be processed. distribution, position distribution in the image to be processed, etc.). Wherein, n is a positive integer, n≤N.
S4:根据第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征,确定该第n个待使用文本图像的图像提取特征。其中,n为正整数,n≤N。S4: According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used. Wherein, n is a positive integer, n≤N.
其中,“第n个待使用文本图像的图像提取特征”用于表示该第n个待使用文本图像所携带的图像信息(例如,字符语种、字符密度、颜色分布、在该待处理图像中位置分布等信息),以使该“第n个待使用文本图像的图像提取特征”能够准确地表示出待处理图像中第n个文本区域所携带的图像信息。Among them, the "image extraction feature of the nth text image to be used" is used to represent the image information carried by the nth text image to be used (for example, character language, character density, color distribution, position in the image to be processed Distribution and other information), so that the "image extraction feature of the nth text image to be used" can accurately represent the image information carried by the nth text area in the image to be processed.
另外,本申请实施例不限定S4的实施方式,例如,当上述“视觉提取特征”包括字符密度特征、颜色分布特征、和图像位置特征时,S4具体可以包括:将第n个待使用文本图像的语种提取特征、该第n个待使用文本图像的字符密度特征、该第n个待使用文本图像的颜色分布特征、以及该第n个待使用文本图像的图像位置特征进行拼接,得到该第n个待使用文本图像的图像提取特征。In addition, the embodiment of the present application does not limit the implementation of S4. For example, when the above-mentioned "visual extraction features" include character density features, color distribution features, and image position features, S4 may specifically include: taking the nth text image to be used The language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the color distribution feature of the nth text image to be used, and the image position feature of the nth text image to be used are spliced to obtain the nth text image to be used Image extraction features of n text images to be used.
需要说明的是,本申请实施例不限定上述“拼接”的实施方式,例如,当第n个待使用文本图像的语种提取特征、该第n个待使用文本图像的字符密度特征、该第n个待使用文本图像的颜色分布特征、以及该第n个待使用文本图像的图像位置特征均为1×512的特征向量时,则该第n个待使用文本图像的图像提取特征可以是4×512的特征向量。It should be noted that this embodiment of the present application does not limit the implementation of the above "stitching". For example, when the language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the nth text image When the color distribution feature of the text image to be used and the image position feature of the nth text image to be used are both 1 × 512 feature vectors, then the image extraction feature of the nth text image to be used can be 4 × 512 eigenvectors.
基于上述S4的相关内容可知,在获取到第n个待使用文本图像的语种提取特征、以及该第n个待使用文本图像的视觉提取特征之后,可以参考上述两项提取特征,确定该第n个待使用文本图像的图像提取特征,以使该图像提取特征能够准确地表示出待处理图像中第n个文本区域所携带的图像信息(例如,字符语种、字符密度、颜色分布、在该待处理图像中位置分布等信息)。Based on the relevant content of S4 above, after obtaining the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used, the nth text image can be determined by referring to the above two extraction features. The image extraction feature of the text image to be used, so that the image extraction feature can accurately represent the image information carried by the nth text area in the image to be processed (for example, character language, character density, color distribution, in the image to be processed process information such as position distribution in the image).
S5:根据N个待使用文本图像的图像提取特征,确定待处理图像的语种识别结果。S5: Determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
其中,“待处理图像的语种识别结果”用于表示该待处理图像所属语种,以使该“待处理图像的语种识别结果”能够准确地表示出该待处理图像中大部分字符信息所属语种(例如,图1所示的越南语)。Among them, the "language recognition result of the image to be processed" is used to indicate the language to which the image to be processed belongs, so that the "language recognition result of the image to be processed" can accurately represent the language to which most of the character information in the image to be processed belongs ( For example, Vietnamese as shown in Figure 1).
另外,本申请实施例不限定S5的实施方式,例如,其具体可以包括S51-S52:In addition, the embodiment of this application does not limit the implementation of S5, for example, it may specifically include S51-S52:
S51:将N个待使用文本图像的图像提取特征进行拼接,得到待处理图像的语种表征数据。S51: Concatenate the image extraction features of the N text images to be used to obtain language representation data of the image to be processed.
其中,“待处理图像的语种表征数据”用于表示该待处理图像中至少一个语种的分布特点(例如,分布范围、分布位置等)。Wherein, the "language representation data of the image to be processed" is used to represent the distribution characteristics of at least one language in the image to be processed (eg, distribution range, distribution location, etc.).
另外,本申请实施例不限定S51中“拼接”的实施方式,例如,当第n个待使用文本图像的图像提取特征是4×512的特征向量时,该待处理图像的语种表征数据可以是N×4×512的特征向量。In addition, this embodiment of the present application does not limit the implementation of "stitching" in S51. For example, when the image extraction feature of the nth text image to be used is a 4×512 feature vector, the language representation data of the image to be processed can be N x 4 x 512 feature vectors.
需要说明的是,上述“1×512”、“4×512”、“N×4×512”均是指一个特征向量的数据维度。It should be noted that the above “1×512”, “4×512”, and “N×4×512” all refer to the data dimension of one feature vector.
S52:将待处理图像的语种表征数据输入预先构建的图像语种识别模型,得到该图像语种识别模型输出的该待处理图像的语种识别结果。S52: Input the language representation data of the image to be processed into the pre-built image language recognition model, and obtain the language recognition result of the image to be processed output by the image language recognition model.
其中,“图像语种识别模型”用于针对该图像语种识别模型的输入数据进行语种识别处理;而且本申请实施例不限定该“图像语种识别模型”,例如,可以采用任一种机器学习模型(例如,基于自注意力学习的神经网络的深度学习模型等)进行实施。Wherein, the "image language recognition model" is used to perform language recognition processing on the input data of the image language recognition model; and the embodiment of the present application does not limit the "image language recognition model", for example, any machine learning model ( For example, a deep learning model of a neural network based on self-attention learning, etc.) is implemented.
另外,为了提高图像语种的识别效果,本申请实施例还提供了“图像语种识别模型”的一种可能的实施方式,其具体可以包括:L 4个第二编码网络、第二线性处理层、和识别层。其中,L 4为正整数,而且本申请实施例不限定L 4,例如,如图7所示,L 4具体可以为6。 In addition, in order to improve the recognition effect of the image language, the embodiment of the present application also provides a possible implementation of the "image language recognition model", which may specifically include: L 4 second encoding networks, a second linear processing layer, and recognition layer. Wherein, L 4 is a positive integer, and the embodiment of the present application does not limit L 4 , for example, as shown in FIG. 7 , L 4 may specifically be 6.
上述“第二线性处理层”用于针对该第二线性处理层的输入数据进行线性处理;而且本申请实施例不限定该“第二线性处理层”的实施方式,例如,可以任一种线性处理方法(例如,transformer模型中linear模块)进行实施。The above-mentioned "second linear processing layer" is used to perform linear processing on the input data of the second linear processing layer; and the embodiment of the present application does not limit the implementation of the "second linear processing layer", for example, any linear processing layer can be used processing methods (for example, the linear module in the transformer model) for implementation.
上述“识别层”用于针对该识别层的输入数据进行语种分类处理;而且本申请实施例不限定该“识别层”的实施方式,例如,可以采用任一种分类方法(例如,transformer模型中softmax模块)进行实施。The above "recognition layer" is used to perform language classification processing for the input data of the recognition layer; and the embodiment of the present application does not limit the implementation of the "recognition layer", for example, any classification method (for example, in the transformer model softmax module) for implementation.
需要说明的是,“第二编码网络”的相关内容请参见上文“第二编码网络”的相关内容。It should be noted that for the relevant content of the "second coding network", please refer to the relevant content of the "second coding network" above.
此外,本申请实施例不限定上述“图像语种识别模型”的构建过程,例如,该“图像语种识别模型”可以根据待使用样本图像的语种表征数据和该待使用样本图像的实际语种进行构建。其中,“待使用样本图像的语种表征数据”的确定过程类似于上述“待处理图像的语种表征数据”的确定过程。“待使用样本图像的实际语种”用于表示该待使用样本图像实际所属语种。又如,其可以采用 方法实施例二所示的模型构建过程进行实施。 In addition, the embodiment of the present application does not limit the construction process of the above "image language recognition model". For example, the "image language recognition model" can be constructed according to the language representation data of the sample image to be used and the actual language of the sample image to be used. Wherein, the determination process of the "language representation data of the sample image to be used" is similar to the determination process of the above-mentioned "language representation data of the image to be processed". The "actual language of the sample image to be used" is used to indicate the actual language of the sample image to be used. As another example, it can be implemented by using the model building process shown in the second method embodiment .
基于上述S51至S52的相关内容可知,在获取到第1个待使用文本图像的图像提取特征、第2个待使用文本图像的图像提取特征、……、以及第N个待使用文本图像的图像提取特征之后,可以先将该N个待使用文本图像的图像提取特征进行拼接,得到待处理图像的语种表征数据;再针对该语种表征数据进行语种识别处理,得到该待处理图像的语种识别结果,以使该语种识别结果能够表示出该待处理图像中大部分字符信息所属语种(例如,图1所示的越南语)。Based on the relevant content of the above S51 to S52, it can be known that after obtaining the image extraction features of the first text image to be used, the image extraction features of the second text image to be used, ..., and the image of the Nth text image to be used After the features are extracted, the image extraction features of the N text images to be used can be spliced first to obtain the language representation data of the image to be processed; and then the language recognition processing is performed on the language representation data to obtain the language recognition result of the image to be processed , so that the language recognition result can represent the language to which most of the character information in the image to be processed belongs (for example, Vietnamese as shown in FIG. 1 ).
基于上述S1至S5的相关内容可知,对于本申请实施例提供的图像语种识别方法来说,在获取到待处理图像之后,先根据该待处理图像的文本检测结果,从该待处理图像中提取 N个待使用文本图像;再确定该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N,N为正整数;然后,根据该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征,确定该第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N,N为正整数;最后,根据N个待使用文本图像的图像提取特征,确定该待处理图像的语种识别结果,以使该语种识别结果能够准确地表示出该待处理图像所属语种,如此能够实现准确地识别出一个图像数据所属语种。Based on the relevant content of the above S1 to S5, it can be seen that for the image language recognition method provided in the embodiment of the present application, after the image to be processed is acquired, the text detection result of the image to be processed is firstly extracted from the image to be processed. N text images to be used; then determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N, and N is a positive integer ; Then, according to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; wherein, n is a positive integer , n≤N, N is a positive integer; finally, according to the image extraction features of the N text images to be used, the language recognition result of the image to be processed is determined, so that the language recognition result can accurately indicate that the image to be processed belongs to language, so that the language of an image data can be accurately identified.
方法实施例二Method embodiment two
为了提高图像语种的识别效果,本申请实施例还提供了一种模型构建方法,其具体可以包括步骤11-步骤16:In order to improve the recognition effect of image language, the embodiment of the present application also provides a model construction method, which may specifically include steps 11-step 16:
步骤11:获取待使用样本图像和该待使用样本图像的实际语种。Step 11: Obtain the sample image to be used and the actual language of the sample image to be used.
其中,“待使用样本图像”是指模型构建过程所需使用的图像数据;而且该“待使用样本图像”可以包括至少一个语种下的字符信息。Wherein, the "sample image to be used" refers to the image data required for the model building process; and the "sample image to be used" may include character information in at least one language.
“待使用样本图像的实际语种”用于表示该待使用样本图像实际所属语种;而且本申请实施例不限定该“待使用样本图像的实际语种”的获取方式,例如,可以通过人工标注的方式进行获取。The "actual language of the sample image to be used" is used to indicate the actual language of the sample image to be used; and the embodiment of the present application does not limit the acquisition method of the "actual language of the sample image to be used", for example, it can be marked manually Get it.
步骤12:根据待使用样本图像的文本检测结果,确定至少一个样本文本图像和该至少一个样本文本图像的位置描述信息。Step 12: Determine at least one sample text image and position description information of the at least one sample text image according to the text detection result of the sample image to be used.
其中,“待使用样本图像的文本检测结果”用于表示该待使用样本图像中至少一个文本区域在该待使用样本图像中所处位置;而且该“待使用样本图像的文本检测结果”的确定过程类似于上文“待处理图像的文本检测结果”的确定过程。Wherein, the "text detection result of the sample image to be used" is used to indicate the location of at least one text region in the sample image to be used; and the determination of the "text detection result of the sample image to be used" The process is similar to the determination process of the "text detection result of the image to be processed" above.
另外,上述“样本文本图像”的确定过程类似于上文“待使用文本图像”的确定过程;而且“样本文本图像的位置描述信息”的确定过程类似于上文“待使用文本图像的位置描述信息”的确定过程。In addition, the determination process of the above "sample text image" is similar to the determination process of the above "text image to be used"; and the determination process of "position description information of the sample text image" is similar to the above "position description of the text image to be used Information" determination process.
步骤13:将至少一个样本文本图像和该至少一个样本文本图像的位置描述信息输入待训练模型,得到该待训练模型输出的待使用样本图像的语种识别结果。Step 13: Input at least one sample text image and the location description information of the at least one sample text image into the model to be trained, and obtain the language recognition result of the sample image to be used output by the model to be trained.
其中,“待训练模型”用于针对该待训练模型的输入数据进行图像语种识别处理。Wherein, the "model to be trained" is used to perform image language recognition processing on the input data of the model to be trained.
另外,本申请实施例不限定该“待训练模型”,例如,其具体可以包括:语种特征提取网络、密度特征提取网络、颜色特征提取网络、位置特征提取网络、特征拼接网络、和图像语种识别网络。其中,图像语种识别网络的输入数据包括特征拼接网络的输出数据;特征拼接网络的输入数据包括语种特征提取网络的输出数据、密度特征提取网络的输出数据、颜色特征提取网络的输出数据、以及位置特征提取网络的输出数据。In addition, the embodiment of the present application does not limit the "model to be trained", for example, it may specifically include: language feature extraction network, density feature extraction network, color feature extraction network, location feature extraction network, feature splicing network, and image language recognition network. Among them, the input data of the image language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the output data of the color feature extraction network, and the position The output data of the feature extraction network.
上述“语种特征提取网络”用于针对文本图像数据(例如,各个样本文本图像)进行语种特征提取处理;而且本申请实施例不限定该“语种特征提取网络”的网络结构,例如,其可以采用上文“语种特征提取模型”的模型结构进行实施。The above-mentioned "language feature extraction network" is used to perform language feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "language feature extraction network", for example, it can use Implement the model structure of the above "language feature extraction model".
上述“密度特征提取网络”用于针对文本图像数据(例如,各个样本文本图像)进行字符密度特征提取处理;而且本申请实施例不限定该“密度特征提取网络”的网络结构,例如,其可以采用上文“密度特征提取模型”的模型结构进行实施。The above-mentioned "density feature extraction network" is used to perform character density feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "density feature extraction network", for example, it can The model structure of the "Density Feature Extraction Model" above is used for implementation.
上述“颜色特征提取网络”用于针对文本图像数据(例如,各个样本文本图像)进行颜色分布特征提取处理;而且本申请实施例不限定该“颜色特征提取网络”的网络结构,例如,其可以采用上文“颜色特征提取模型”的模型结构进行实施。The above-mentioned "color feature extraction network" is used to perform color distribution feature extraction processing for text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "color feature extraction network", for example, it can The model structure of the above "color feature extraction model" is used for implementation.
上述“位置特征提取网络”用于针对文本图像数据(例如,各个样本文本图像)的位置描述信息进行图像位置特征提取处理;而且本申请实施例不限定该“位置特征提取网络”的网络结构,例如,其可以采用上文“位置特征提取模型”的模型结构进行实施。The above-mentioned "position feature extraction network" is used to perform image position feature extraction processing on the position description information of text image data (for example, each sample text image); and the embodiment of the present application does not limit the network structure of the "position feature extraction network", For example, it can be implemented using the model structure of the above "location feature extraction model".
上述“特征拼接网络”用于将该特征拼接网络的输入数据进行拼接处理;而且本申请实施例不限定该“特征拼接网络”的工作原理,为了便于理解,下面结合示例进行说明。The above-mentioned "feature splicing network" is used to concatenate the input data of the feature splicing network; and the embodiment of the present application does not limit the working principle of the "feature splicing network".
作为示例,当上述“至少一个样本文本图像”包括K个样本文本图像时,该“特征拼接网络”的工作原理具体可以包括:先将第k个样本文本图像的语种提取特征、该第k个样本文本图像的字符密度特征、该第k个样本文本图像的颜色分布特征、以及该第k个样本文本图像的图像位置特征进行拼接,得到该第k个样本文本图像的图像提取特征;k为正整数,k≤K,K为正整数;再将第1个样本文本图像的图像提取特征至第K个样本文本图像的图像提取特征进行拼接,得到待使用样本图像的语种表征数据(也就是,上述“特征拼接网络”的输出结果)。As an example, when the above-mentioned "at least one sample text image" includes K sample text images, the working principle of the "feature mosaic network" may specifically include: first extracting features of the language of the k-th sample text image, the k-th sample text image The character density feature of the sample text image, the color distribution feature of the k th sample text image, and the image position feature of the k th sample text image are spliced to obtain the image extraction feature of the k th sample text image; k is A positive integer, k≤K, K is a positive integer; then the image extraction features of the first sample text image are spliced to the image extraction features of the K sample text image to obtain the language representation data of the sample image to be used (that is, , the output of the "feature stitching network" above).
上述“图像语种识别网络”用于针对该图像语种识别网络的输入数据进行语种识别处理;而且本申请实施例不限定该“图像语种识别网络”的网络结构,例如,其可以采用上文“图像语种识别模型”的模型结构进行实施。The above-mentioned "image language recognition network" is used to perform language recognition processing on the input data of the image language recognition network; and the embodiment of the present application does not limit the network structure of the "image language recognition network", for example, it can adopt the above "image language recognition network" The model structure of "Language Recognition Model" is implemented.
基于上述步骤13的相关内容可知,在获取到至少一个样本文本图像和该至少一个样本文本图像的位置描述信息之后,可以将该至少一个样本文本图像及其位置描述信息输入待训练模型,以使该待训练模型参考该至少一个样本文本图像及其位置描述信息进行图像语种识别处理,得到并输出待使用样本图像的语种识别结果。Based on the relevant content of the above step 13, after obtaining at least one sample text image and the position description information of the at least one sample text image, the at least one sample text image and its position description information can be input into the model to be trained, so that The model to be trained refers to the at least one sample text image and its location description information to perform image language recognition processing, and obtains and outputs a language recognition result of the sample image to be used.
步骤14:判断是否达到预设停止条件,若是,则执行步骤16;若否,则执行步骤15。Step 14: Judging whether the preset stop condition is met, if yes, execute step 16; if not, execute step 15.
其中,“预设停止条件”可以预先设定;而且本申请实施例不限定该“预设停止条件”,例如,其具体可以为待训练模型的损失值低于第一阈值;也可以为该待训练模型的损失值的变化率低于第二阈值(也就是,该待训练模型的图像语种识别性能达到收敛),还可以为该待训练模型的更新次数达到第三阈值。Among them, the "preset stop condition" can be preset; and the embodiment of the present application does not limit the "preset stop condition", for example, it can specifically be that the loss value of the model to be trained is lower than the first threshold; it can also be the The rate of change of the loss value of the model to be trained is lower than the second threshold (that is, the image language recognition performance of the model to be trained reaches convergence), and the number of updates of the model to be trained may also reach a third threshold.
上述“待训练模型的损失值”用于表示该待训练模型的图像语种识别性能;而且本申请实施例不限定“待训练模型的损失值”的确定过程,可以采用现有的或者未来出现的任一种模型损失值确定方法进行实施。The above "loss value of the model to be trained" is used to represent the image language recognition performance of the model to be trained; and the embodiment of the present application does not limit the determination process of the "loss value of the model to be trained", existing or future emerging Any method for determining the model loss value is implemented.
步骤15:根据待使用样本图像的语种识别结果和该待使用样本图像的实际语种,更新待训练模型,并返回继续执行步骤13。Step 15: Update the model to be trained according to the language recognition result of the sample image to be used and the actual language of the sample image to be used, and return to step 13.
本申请实施例中,在确定当前轮的待训练模型没有达到预设停止条件之后,可以确定该待训练模型的图像语种识别性能仍然比较差,故可以先依据待使用样本图像的语种识别结果与该待使用样本图像的实际语种之间的差异性,更新该待训练模型,以使更新后的待 训练模型能够具有更好的图像语种识别性能;再基于更新后的待训练模型继续执行步骤13及其后续步骤,以实现针对该待训练模型的新一轮训练过程。In the embodiment of the present application, after it is determined that the current round of the model to be trained does not meet the preset stop condition, it can be determined that the image language recognition performance of the model to be trained is still relatively poor, so it can be based on the language recognition result of the sample image to be used and The difference between the actual languages of the sample images to be used is to update the model to be trained so that the updated model to be trained can have better image language recognition performance; then continue to perform step 13 based on the updated model to be trained and subsequent steps to implement a new round of training process for the model to be trained.
需要说明的是,本申请实施例不限定待训练模型的更新过程,可以采用现有的或者未来出现的任一种模型更新方法进行实施。It should be noted that the embodiment of the present application does not limit the update process of the model to be trained, and any existing or future model update method can be used for implementation.
步骤16:根据待训练模型,确定图像语种识别模型。Step 16: Determine the image language recognition model according to the model to be trained.
本申请实施例中,在确定当前轮的待训练模型已达到预设停止条件之后,可以确定该待训练模型具有较好的图像语种识别性能,故可以根据该待训练模型,确定图像语种识别模型(例如,可以直接将待训练模型中图像语种识别网络确定为图像语种识别模型)。In the embodiment of the present application, after it is determined that the model to be trained in the current round has reached the preset stop condition, it can be determined that the model to be trained has better image language recognition performance, so the image language recognition model can be determined according to the model to be trained (For example, the image language recognition network in the model to be trained can be directly determined as the image language recognition model).
另外,在一些应用场景下,也可以利用训练好的待训练模型,确定其他模型(例如,语种特征提取模型、密度特征提取模型、颜色特征提取模型、以及位置特征提取模型等)。基于此可知,步骤16具体可以包括:将待训练模型中语种特征提取网络确定为语种特征提取模型;将待训练模型中密度特征提取网络确定为密度特征提取模型;将待训练模型中颜色特征提取网络确定为颜色特征提取模型;将待训练模型中位置特征提取网络确定为位置特征提取模型;将待训练模型中图像语种识别网络确定为图像语种识别模型。In addition, in some application scenarios, other models (for example, language feature extraction model, density feature extraction model, color feature extraction model, and location feature extraction model, etc.) can also be determined by using the trained model to be trained. Based on this, it can be known that step 16 may specifically include: determining the language feature extraction network in the model to be trained as the language feature extraction model; determining the density feature extraction network in the model to be trained as the density feature extraction model; extracting the color feature in the model to be trained The network is determined as the color feature extraction model; the position feature extraction network in the model to be trained is determined as the position feature extraction model; the image language recognition network in the model to be trained is determined as the image language recognition model.
基于上述步骤11至步骤16的相关内容可知,在一些情况下,可以借助一个模型的训练过程来构建语种特征提取模型、密度特征提取模型、颜色特征提取模型、位置特征提取模型、以及图像语种识别模型,以使基于这五个模型进行实施的图像语种识别方法具有更好的图像语种识别效果。Based on the relevant content of the above steps 11 to 16, in some cases, a language feature extraction model, density feature extraction model, color feature extraction model, location feature extraction model, and image language recognition can be constructed by means of a model training process model, so that the image language recognition method implemented based on these five models has a better image language recognition effect.
另外,为了进一步提高模型构建效果,本申请实施例还提供了模型构建方法的另一种可能的实施方式,在该实施方式中,该模型构建方法除了包括上述步骤11-步骤16以外,可以还包括步骤17-步骤21:In addition, in order to further improve the effect of model building, the embodiment of the present application also provides another possible implementation of the model building method. In this embodiment, in addition to the above steps 11 to 16, the model building method may also include Including Step 17-Step 21:
步骤17:利用第一文本图像和该第一文本图像的实际语种特征,训练第一模型,以使训练好的第一模型具有较好的语种特征提取效果。Step 17: Using the first text image and the actual language features of the first text image, train the first model, so that the trained first model has a better language feature extraction effect.
步骤18:利用第二文本图像和该第二文本图像的实际密度特征,训练第二模型,以使训练好的第二模型具有较好的字符密度特征提取效果。Step 18: Using the second text image and the actual density features of the second text image to train the second model, so that the trained second model has a better character density feature extraction effect.
步骤19:利用第三文本图像和该第三文本图像的实际颜色特征,训练第三模型,以使训练好的第三模型具有较好的颜色分布特征提取效果。Step 19: using the third text image and the actual color features of the third text image to train the third model, so that the trained third model has better color distribution feature extraction effect.
步骤20:利用第四文本图像的位置描述信息和该第四文本图像的实际位置特征,训练第四模型,以使训练好的第四模型具有较好的图像位置特征提取效果。Step 20: Using the position description information of the fourth text image and the actual position feature of the fourth text image, train the fourth model, so that the trained fourth model has a better image position feature extraction effect.
步骤21:利用训练好的第一模型、训练好的第二模型、训练好的第三模型、以及训练好的第四模型,分别对待训练模型中语种特征提取网络、密度特征提取网络、颜色特征提取网络、以及位置特征提取网络进行初始化处理。Step 21: Use the trained first model, trained second model, trained third model, and trained fourth model to treat the language feature extraction network, density feature extraction network, and color feature in the training model respectively. The extraction network and the location feature extraction network are initialized.
需要说明的是,本申请实施例不限定步骤21的具体实施方式,例如,其具体可以包括:将训练好的第一模型确定为待训练模型中语种特征提取网络的初始化处理结果;将训练好的第二模型确定为待训练模型中密度特征提取网络的初始化处理结果;将训练好的第三模型确定为待训练模型中颜色特征提取网络的初始化处理结果;将训练好的第四模型确定为待训练模型中位置特征提取网络的初始化处理结果。It should be noted that the embodiment of the present application does not limit the specific implementation of step 21. For example, it may specifically include: determining the trained first model as the initialization processing result of the language feature extraction network in the model to be trained; The second model is determined to be the initialization processing result of the density feature extraction network in the model to be trained; the trained third model is determined to be the initialization processing result of the color feature extraction network in the model to be trained; the trained fourth model is determined to be The initialization processing result of the location feature extraction network in the model to be trained.
基于上述步骤17至步骤21的相关内容可知,在一些情况下,可以先针对第一模型至第五模型分别进行训练;再利用训练好的第一模型至第五模型,对待训练模型中语种特征提取网络、密度特征提取网络、颜色特征提取网络、以及位置特征提取网络进行初始化处理,得到初始化后的待训练模型;然后,利用上述步骤11-步骤15,对该初始化后的待训练模型进行训练处理,得到训练好的待训练模型;最后,从训练好的待训练模型中确定出语种特征提取模型、密度特征提取模型、颜色特征提取模型、位置特征提取模型、以及图像语种识别模型。Based on the relevant content of the above steps 17 to 21, it can be known that in some cases, the first to fifth models can be trained respectively; Initialize the extraction network, density feature extraction network, color feature extraction network, and position feature extraction network to obtain an initialized model to be trained; then, use the above steps 11 to 15 to train the initialized model to be trained processing to obtain a trained model to be trained; finally, a language feature extraction model, a density feature extraction model, a color feature extraction model, a location feature extraction model, and an image language recognition model are determined from the trained model to be trained.
基于上述方法实施例提供的图像语种识别方法,本申请实施例还提供了一种图像语种识别装置,下面结合附图进行解释和说明。Based on the image language recognition method provided by the above method embodiment, the embodiment of the present application also provides an image language recognition device, which will be explained and described below with reference to the accompanying drawings.
装置实施例Device embodiment
装置实施例提供的图像语种识别装置的技术详情,请参照上述方法实施例。For the technical details of the image language recognition device provided by the device embodiment, please refer to the above method embodiment.
参见图8,该图为本申请实施例提供的一种图像语种识别装置的结构示意图。Refer to FIG. 8 , which is a schematic structural diagram of an image language recognition device provided by an embodiment of the present application.
本申请实施例提供的图像语种识别装置800,包括:The image language recognition device 800 provided in the embodiment of the present application includes:
图像提取单元801,用于在获取到待处理图像之后,根据所述待处理图像的文本检测结果,从所述待处理图像中提取N个待使用文本图像;其中,N为正整数;An image extraction unit 801, configured to extract N text images to be used from the image to be processed according to the text detection result of the image to be processed after acquiring the image to be processed; wherein, N is a positive integer;
特征确定单元802,用于确定所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N;A feature determining unit 802, configured to determine the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used; wherein, n is a positive integer, n≤N;
特征处理单元803,用于根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N;A feature processing unit 803, configured to determine the image extraction feature of the nth text image to be used according to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used ; Among them, n is a positive integer, n≤N;
语种识别单元804,用于根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果。The language recognition unit 804 is configured to determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
在一种可能的实施方式中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征中的至少一个。In a possible implementation manner, the visual extraction features include at least one of character density features, color distribution features, and image position features.
在一种可能的实施方式中,所述特征确定单元802,包括:In a possible implementation manner, the feature determining unit 802 includes:
第一确定子单元,用于将所述第n个待使用文本图像输入预先构建的密度特征提取模型,得到所述密度特征提取模型输出的所述第n个待使用文本图像的字符密度特征;The first determination subunit is configured to input the nth text image to be used into a pre-built density feature extraction model, and obtain the character density feature of the nth text image to be used output by the density feature extraction model;
第二确定子单元,用于将所述第n个待使用文本图像输入预先构建的颜色特征提取模型,得到所述颜色特征提取模型输出的所述第n个待使用文本图像的颜色分布特征;The second determination subunit is configured to input the nth text image to be used into a pre-built color feature extraction model, and obtain the color distribution characteristics of the nth text image to be used output by the color feature extraction model;
第三确定子单元,用于将所述第n个待使用文本图像的位置描述信息输入预先构建的位置特征提取模型,得到所述位置特征提取模型输出的所述第n个待使用文本图像的图像位置特征。The third determination subunit is configured to input the position description information of the nth text image to be used into a pre-built position feature extraction model, and obtain the output of the nth text image to be used outputted by the position feature extraction model Image location features.
在一种可能的实施方式中,所述特征确定单元802,包括:In a possible implementation manner, the feature determining unit 802 includes:
第四确定子单元,用于将所述第n个待使用文本图像输入预先构建的语种特征提取模型,得到所述语种特征提取模型输出的所述第n个待使用文本图像的语种提取特征。The fourth determining subunit is configured to input the nth text image to be used into a pre-built language feature extraction model, and obtain the language extraction features of the nth text image to be used output by the language feature extraction model.
在一种可能的实施方式中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征;In a possible implementation manner, the visual extraction features include character density features, color distribution features, and image position features;
所述特征处理单元803,具体用于:将所述第n个待使用文本图像的语种提取特征、所述第n个待使用文本图像的字符密度特征、所述第n个待使用文本图像的颜色分布特征、以及所述第n个待使用文本图像的图像位置特征进行拼接,得到所述第n个待使用文本图像的图像提取特征。The feature processing unit 803 is specifically configured to: extract the language feature of the nth text image to be used, the character density feature of the nth text image to be used, the character density feature of the nth text image to be used, The color distribution feature and the image position feature of the nth text image to be used are spliced to obtain the image extraction feature of the nth text image to be used.
在一种可能的实施方式中,所述语种识别单元804,具体用于:将所述N个待使用文本图像的图像提取特征进行拼接,得到所述待处理图像的语种表征数据;将所述语种表征数据输入预先构建的图像语种识别模型,得到所述图像语种识别模型输出的所述待处理图像的语种识别结果。In a possible implementation manner, the language identification unit 804 is specifically configured to: concatenate the image extraction features of the N text images to be used to obtain language representation data of the image to be processed; The language representation data is input into the pre-built image language recognition model, and the language recognition result of the image to be processed outputted by the image language recognition model is obtained.
在一种可能的实施方式中,所述图像语种识别装置800,还包括:In a possible implementation manner, the image language recognition device 800 further includes:
模型训练单元,用于获取待使用样本图像和所述待使用样本图像的实际语种;根据所述待使用样本图像的文本检测结果,确定至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息;将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型,得到所述待训练模型输出的所述待使用样本图像的语种识别结果;根据所述待使用样本图像的语种识别结果和所述待使用样本图像的实际语种,更新所述待训练模型,并继续执行所述将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型的步骤,直至在达到预设停止条件之后,根据所述待训练模型,确定所述图像语种识别模型。A model training unit, configured to acquire the sample image to be used and the actual language of the sample image to be used; determine at least one sample text image and the position of the at least one sample text image according to the text detection result of the sample image to be used Descriptive information; input the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, and obtain the language recognition result of the sample image to be used output by the model to be trained; according to the The language recognition result of the sample image to be used and the actual language of the sample image to be used are updated to the model to be trained, and the description of the position of the at least one sample text image and the at least one sample text image is continued. The step of inputting the information into the model to be trained until the preset stop condition is reached, and then determining the image language recognition model according to the model to be trained.
在一种可能的实施方式中,所述待训练模型包括语种特征提取网络、密度特征提取网络、颜色特征提取网络、位置特征提取网络、特征拼接网络、和图像语种识别网络;其中,所述图像语种识别网络的输入数据包括所述特征拼接网络的输出数据;所述特征拼接网络的输入数据包括所述语种特征提取网络的输出数据、所述密度特征提取网络的输出数据、所述颜色特征提取网络的输出数据、以及所述位置特征提取网络的输出数据;In a possible implementation, the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a location feature extraction network, a feature splicing network, and an image language recognition network; wherein, the image The input data of the language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the color feature extraction The output data of the network, and the output data of the location feature extraction network;
所述图像语种识别模型的确定过程,包括:将所述待训练模型中图像语种识别网络,确定为所述图像语种识别模型。The process of determining the image language recognition model includes: determining the image language recognition network in the model to be trained as the image language recognition model.
在一种可能的实施方式中,所述图像语种识别装置800,还包括:In a possible implementation manner, the image language recognition device 800 further includes:
模型初始化单元,用于利用第一文本图像和所述第一文本图像的实际语种特征,训练第一模型;利用第二文本图像和所述第二文本图像的实际密度特征,训练第二模型;利用第三文本图像和所述第三文本图像的实际颜色特征,训练第三模型;利用第四文本图像的位置描述信息和所述第四文本图像的实际位置特征,训练第四模型;利用训练好的所述第一模型、训练好的所述第二模型、训练好的所述第三模型、以及训练好的所述第四模型,分别对所述待训练模型中所述语种特征提取网络、所述密度特征提取网络、所述颜色特征提取网络、以及所述位置特征提取网络进行初始化处理。A model initialization unit, configured to use the first text image and the actual language features of the first text image to train the first model; use the second text image and the actual density features of the second text image to train the second model; Utilize the third text image and the actual color feature of the third text image to train the third model; utilize the position description information of the fourth text image and the actual position feature of the fourth text image to train the fourth model; use the training The first model that has been trained, the second model that has been trained, the third model that has been trained, and the fourth model that has been trained are respectively for the language feature extraction network in the model to be trained , the density feature extraction network, the color feature extraction network, and the location feature extraction network are initialized.
基于上述图像语种识别装置200的相关内容可知,对于图像语种识别装置200来说,在获取到待处理图像之后,先根据该待处理图像的文本检测结果,从该待处理图像中提取N个待使用文本图像;再确定该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N,N为正整数;然后,根据该第n个待使用文本图像的语种提取特征和该第n个待使用文本图像的视觉提取特征,确定该第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N,N为正整数;最后,根据N个待使用 文本图像的图像提取特征,确定该待处理图像的语种识别结果,以使该语种识别结果能够准确地表示出该待处理图像所属语种,如此能够实现准确地识别出一个图像数据所属语种。Based on the relevant content of the above-mentioned image language recognition device 200, it can be seen that for the image language recognition device 200, after acquiring the image to be processed, it first extracts N texts to be processed from the image to be processed according to the text detection results of the image to be processed. Use a text image; then determine the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used; wherein, n is a positive integer, n≤N, and N is a positive integer; then, According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; wherein, n is a positive integer, n≤ N, N is a positive integer; finally, according to the image extraction features of the N text images to be used, determine the language recognition result of the image to be processed, so that the language recognition result can accurately represent the language of the image to be processed, so It is possible to accurately identify the language to which an image data belongs.
进一步地,本申请实施例还提供了一种设备,所述设备包括处理器以及存储器:Further, the embodiment of the present application also provides a device, the device includes a processor and a memory:
所述存储器用于存储计算机程序;The memory is used to store computer programs;
所述处理器用于根据所述计算机程序执行本申请实施例提供的图像语种识别方法的任一实施方式。The processor is configured to execute any implementation of the image language recognition method provided in the embodiments of the present application according to the computer program.
进一步地,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序用于执行本申请实施例提供的图像语种识别方法的任一实施方式。Further, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any of the image language recognition methods provided in the embodiment of the present application. One embodiment.
进一步地,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行本申请实施例提供的图像语种识别方法的任一实施方式。Furthermore, the embodiment of the present application also provides a computer program product, which, when running on the terminal device, enables the terminal device to execute any implementation manner of the image language recognition method provided in the embodiment of the present application.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in this application, "at least one (item)" means one or more, and "multiple" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.
以上所述,仅是本发明的较佳实施例而已,并非对本发明作任何形式上的限制。虽然本发明已以较佳实施例揭露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。The above descriptions are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims (13)

  1. 一种图像语种识别方法,所述方法包括:An image language recognition method, the method comprising:
    在获取到待处理图像之后,根据所述待处理图像的文本检测结果,从所述待处理图像中提取N个待使用文本图像;其中,N为正整数;After the image to be processed is acquired, according to the text detection result of the image to be processed, N text images to be used are extracted from the image to be processed; wherein, N is a positive integer;
    确定第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N;Determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N;
    根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N;According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determine the image extraction feature of the nth text image to be used; wherein, n is a positive integer , n≤N;
    根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果。According to the image extraction features of the N text images to be used, the language recognition result of the image to be processed is determined.
  2. 根据权利要求1所述的方法,其中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征中的至少一个。The method of claim 1, wherein the visually extracted features include at least one of character density features, color distribution features, and image location features.
  3. 根据权利要求2所述的方法,其中,所述第n个待使用文本图像的字符密度特征的确定过程,包括:The method according to claim 2, wherein the determination process of the character density feature of the nth text image to be used comprises:
    将所述第n个待使用文本图像输入预先构建的密度特征提取模型,得到所述密度特征提取模型输出的所述第n个待使用文本图像的字符密度特征;Inputting the nth text image to be used into a pre-built density feature extraction model to obtain the character density feature of the nth text image to be used output by the density feature extraction model;
    所述第n个待使用文本图像的颜色分布特征的确定过程,包括:The process of determining the color distribution characteristics of the nth text image to be used includes:
    将所述第n个待使用文本图像输入预先构建的颜色特征提取模型,得到所述颜色特征提取模型输出的所述第n个待使用文本图像的颜色分布特征;Inputting the nth text image to be used into a pre-built color feature extraction model to obtain the color distribution characteristics of the nth text image to be used output by the color feature extraction model;
    所述第n个待使用文本图像的图像位置特征的确定过程,包括:The determination process of the image location feature of the nth text image to be used includes:
    将所述第n个待使用文本图像的位置描述信息输入预先构建的位置特征提取模型,得到所述位置特征提取模型输出的所述第n个待使用文本图像的图像位置特征。Inputting the position description information of the nth text image to be used into a pre-built position feature extraction model to obtain the image position feature of the nth text image to be used output by the position feature extraction model.
  4. 根据权利要求1所述的方法,其中,所述第n个待使用文本图像的语种提取特征的确定过程,包括:The method according to claim 1, wherein the determination process of the language extraction feature of the nth text image to be used comprises:
    将所述第n个待使用文本图像输入预先构建的语种特征提取模型,得到所述语种特征提取模型输出的所述第n个待使用文本图像的语种提取特征。Inputting the nth text image to be used into a pre-built language feature extraction model to obtain the language extraction features of the nth text image to be used output by the language feature extraction model.
  5. 根据权利要求1所述的方法,其中,所述视觉提取特征包括字符密度特征、颜色分布特征、和图像位置特征;The method according to claim 1, wherein said visual extraction features include character density features, color distribution features, and image location features;
    所述根据所述第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征,包括:According to the language extraction feature of the nth text image to be used and the visual extraction feature of the nth text image to be used, determining the image extraction feature of the nth text image to be used includes:
    将所述第n个待使用文本图像的语种提取特征、所述第n个待使用文本图像的字符密度特征、所述第n个待使用文本图像的颜色分布特征、以及所述第n个待使用文本图像的图像位置特征进行拼接,得到所述第n个待使用文本图像的图像提取特征。The language extraction feature of the nth text image to be used, the character density feature of the nth text image to be used, the color distribution feature of the nth text image to be used, and the nth text image to be used The image position features of the text images are used for splicing to obtain the image extraction features of the nth text image to be used.
  6. 根据权利要求1所述的方法,其中,所述根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果,包括:The method according to claim 1, wherein the determining the language recognition result of the image to be processed according to the image extraction features of the N text images to be used comprises:
    将所述N个待使用文本图像的图像提取特征进行拼接,得到所述待处理图像的语种表征数据;splicing the image extraction features of the N text images to be used to obtain language representation data of the image to be processed;
    将所述语种表征数据输入预先构建的图像语种识别模型,得到所述图像语种识别模型输出的所述待处理图像的语种识别结果。Inputting the language representation data into a pre-built image language recognition model to obtain a language recognition result of the image to be processed output by the image language recognition model.
  7. 根据权利要求6所述的方法,其中,所述图像语种识别模型的构建过程,包括:The method according to claim 6, wherein, the construction process of the image language recognition model comprises:
    获取待使用样本图像和所述待使用样本图像的实际语种;Acquiring the sample image to be used and the actual language of the sample image to be used;
    根据所述待使用样本图像的文本检测结果,确定至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息;determining at least one sample text image and location description information of the at least one sample text image according to the text detection result of the sample image to be used;
    将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型,得到所述待训练模型输出的所述待使用样本图像的语种识别结果;inputting the at least one sample text image and the location description information of the at least one sample text image into the model to be trained, and obtaining the language recognition result of the sample image to be used output by the model to be trained;
    根据所述待使用样本图像的语种识别结果和所述待使用样本图像的实际语种,更新所述待训练模型,并继续执行所述将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型的步骤,直至在达到预设停止条件之后,根据所述待训练模型,确定所述图像语种识别模型。According to the language recognition result of the sample image to be used and the actual language of the sample image to be used, the model to be trained is updated, and the step of combining the at least one sample text image and the at least one sample text image is continued. The step of inputting the location description information into the model to be trained until the preset stop condition is reached, and the image language recognition model is determined according to the model to be trained.
  8. 根据权利要求7所述的方法,其中,所述待训练模型包括语种特征提取网络、密度特征提取网络、颜色特征提取网络、位置特征提取网络、特征拼接网络、和图像语种识别网络;其中,所述图像语种识别网络的输入数据包括所述特征拼接网络的输出数据;所述特征拼接网络的输入数据包括所述语种特征提取网络的输出数据、所述密度特征提取网络的输出数据、所述颜色特征提取网络的输出数据、以及所述位置特征提取网络的输出数据;The method according to claim 7, wherein the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a position feature extraction network, a feature splicing network, and an image language recognition network; wherein, the The input data of the image language recognition network includes the output data of the feature splicing network; the input data of the feature splicing network includes the output data of the language feature extraction network, the output data of the density feature extraction network, the color The output data of the feature extraction network, and the output data of the location feature extraction network;
    所述根据所述待训练模型,确定所述图像语种识别模型,包括:According to the model to be trained, determining the image language recognition model includes:
    将所述待训练模型中图像语种识别网络,确定为所述图像语种识别模型。The image language recognition network in the model to be trained is determined as the image language recognition model.
  9. 根据权利要求8所述的方法,其中,在所述将所述至少一个样本文本图像和所述至少一个样本文本图像的位置描述信息输入待训练模型之前,所述图像语种识别模型的构建过程还包括:The method according to claim 8, wherein, before inputting the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, the construction process of the image language recognition model further include:
    利用第一文本图像和所述第一文本图像的实际语种特征,训练第一模型;Using the first text image and the actual language features of the first text image to train the first model;
    利用第二文本图像和所述第二文本图像的实际密度特征,训练第二模型;training a second model using the second text image and the actual density features of the second text image;
    利用第三文本图像和所述第三文本图像的实际颜色特征,训练第三模型;using the third text image and the actual color features of the third text image to train a third model;
    利用第四文本图像的位置描述信息和所述第四文本图像的实际位置特征,训练第四模型;Using the location description information of the fourth text image and the actual location features of the fourth text image to train a fourth model;
    利用训练好的所述第一模型、训练好的所述第二模型、训练好的所述第三模型、以及训练好的所述第四模型,分别对所述待训练模型中所述语种特征提取网络、所述密度特征提取网络、所述颜色特征提取网络、以及所述位置特征提取网络进行初始化处理。Using the trained first model, the trained second model, the trained third model, and the trained fourth model, respectively for the language features in the model to be trained The extraction network, the density feature extraction network, the color feature extraction network, and the position feature extraction network are initialized.
  10. 一种图像语种识别装置,包括:An image language recognition device, comprising:
    图像提取单元,用于在获取到待处理图像之后,根据所述待处理图像的文本检测结果,从所述待处理图像中提取N个待使用文本图像;其中,N为正整数;An image extraction unit, configured to extract N text images to be used from the image to be processed according to the text detection result of the image to be processed after acquiring the image to be processed; wherein, N is a positive integer;
    特征确定单元,用于确定第n个待使用文本图像的语种提取特征和所述第n个待使用文本图像的视觉提取特征;其中,n为正整数,n≤N;A feature determination unit, configured to determine the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; wherein, n is a positive integer, n≤N;
    特征处理单元,用于根据所述第n个待使用文本图像的语种提取特征和所述第n个 待使用文本图像的视觉提取特征,确定所述第n个待使用文本图像的图像提取特征;其中,n为正整数,n≤N;A feature processing unit, configured to determine the image extraction features of the nth text image to be used according to the language extraction features of the nth text image to be used and the visual extraction features of the nth text image to be used; Among them, n is a positive integer, n≤N;
    语种识别单元,用于根据所述N个待使用文本图像的图像提取特征,确定所述待处理图像的语种识别结果。The language recognition unit is configured to determine the language recognition result of the image to be processed according to the image extraction features of the N text images to be used.
  11. 一种设备,所述设备包括处理器以及存储器:An apparatus comprising a processor and memory:
    所述存储器用于存储计算机程序;The memory is used to store computer programs;
    所述处理器用于根据所述计算机程序执行权利要求1-9中任一项所述的方法。The processor is configured to execute the method according to any one of claims 1-9 according to the computer program.
  12. 一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序用于执行权利要求1-9中任一项所述的方法。A computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method according to any one of claims 1-9.
  13. 一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行权利要求1-9中任一项所述的方法。A computer program product, when running on a terminal device, the computer program product causes the terminal device to execute the method according to any one of claims 1-9.
PCT/CN2022/116011 2021-09-27 2022-08-31 Image language identification method and related device thereof WO2023045721A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111138638.8A CN113822275A (en) 2021-09-27 2021-09-27 Image language identification method and related equipment thereof
CN202111138638.8 2021-09-27

Publications (1)

Publication Number Publication Date
WO2023045721A1 true WO2023045721A1 (en) 2023-03-30

Family

ID=78921418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116011 WO2023045721A1 (en) 2021-09-27 2022-08-31 Image language identification method and related device thereof

Country Status (2)

Country Link
CN (1) CN113822275A (en)
WO (1) WO2023045721A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822275A (en) * 2021-09-27 2021-12-21 北京有竹居网络技术有限公司 Image language identification method and related equipment thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615252A (en) * 2008-06-25 2009-12-30 中国科学院自动化研究所 A kind of method for extracting text information from adaptive images
US20120134576A1 (en) * 2010-11-26 2012-05-31 Sharma Avinash Automatic recognition of images
CN105760901A (en) * 2016-01-27 2016-07-13 南开大学 Automatic language identification method for multilingual skew document image
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
CN111507344A (en) * 2019-01-30 2020-08-07 北京奇虎科技有限公司 Method and device for recognizing characters from image
CN113822275A (en) * 2021-09-27 2021-12-21 北京有竹居网络技术有限公司 Image language identification method and related equipment thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339787B (en) * 2018-12-17 2023-09-19 北京嘀嘀无限科技发展有限公司 Language identification method and device, electronic equipment and storage medium
CN111488826B (en) * 2020-04-10 2023-10-17 腾讯科技(深圳)有限公司 Text recognition method and device, electronic equipment and storage medium
CN112101354A (en) * 2020-09-23 2020-12-18 广州虎牙科技有限公司 Text recognition model training method, text positioning method and related device
CN112613502A (en) * 2020-12-28 2021-04-06 深圳壹账通智能科技有限公司 Character recognition method and device, storage medium and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615252A (en) * 2008-06-25 2009-12-30 中国科学院自动化研究所 A kind of method for extracting text information from adaptive images
US20120134576A1 (en) * 2010-11-26 2012-05-31 Sharma Avinash Automatic recognition of images
CN105760901A (en) * 2016-01-27 2016-07-13 南开大学 Automatic language identification method for multilingual skew document image
CN111507344A (en) * 2019-01-30 2020-08-07 北京奇虎科技有限公司 Method and device for recognizing characters from image
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
CN113822275A (en) * 2021-09-27 2021-12-21 北京有竹居网络技术有限公司 Image language identification method and related equipment thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUO, LONG, ZHOU XI-JIAN, LIN: "Script Identification of Document Image Based on Stroke Direction Histogram GUO Long PING", JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, vol. 12, no. 2, 1 April 2011 (2011-04-01), XP093051891 *

Also Published As

Publication number Publication date
CN113822275A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN110674639B (en) Natural language understanding method based on pre-training model
Li et al. Facial expression recognition with faster R-CNN
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN112949415B (en) Image processing method, apparatus, device and medium
CN109858039A (en) A kind of text information identification method and identification device
CN115526259A (en) Training method and device for multi-mode pre-training model
CN113780486B (en) Visual question answering method, device and medium
WO2023045721A1 (en) Image language identification method and related device thereof
CN111639186A (en) Multi-class multi-label text classification model and device dynamically embedded with projection gate
CN111666937A (en) Method and system for recognizing text in image
CN110968697A (en) Text classification method, device and equipment and readable storage medium
EP4057283A2 (en) Method for detecting voice, method for training, apparatuses and smart speaker
CN114612921A (en) Form recognition method and device, electronic equipment and computer readable medium
CN111862031A (en) Face synthetic image detection method and device, electronic equipment and storage medium
CN115130613A (en) False news identification model construction method, false news identification method and device
CN114548274A (en) Multi-modal interaction-based rumor detection method and system
CN110738261B (en) Image classification and model training method and device, electronic equipment and storage medium
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN116363249A (en) Controllable image generation method and device and electronic equipment
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN112560848B (en) Training method and device for POI (Point of interest) pre-training model and electronic equipment
CN115546897A (en) Sign language recognition method and device, electronic equipment and readable storage medium
Bi et al. Chinese character captcha sequential selection system based on convolutional neural network
CN111583352A (en) Stylized icon intelligent generation method for mobile terminal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871771

Country of ref document: EP

Kind code of ref document: A1