CN113822275A

CN113822275A - Image language identification method and related equipment thereof

Info

Publication number: CN113822275A
Application number: CN202111138638.8A
Authority: CN
Inventors: 毛晓飞; 黄灿; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-21
Also published as: WO2023045721A1

Abstract

The application discloses an image language identification method and related equipment thereof, wherein the method comprises the following steps: after an image to be processed is obtained, extracting N text images to be used from the image to be processed according to a text detection result of the image to be processed; then determining the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; then, determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; and finally, determining the language identification result of the image to be processed according to the image extraction characteristics of the N text images to be used, so that the language identification result can accurately represent the language of the image to be processed.

Description

Image language identification method and related equipment thereof

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image language identification method and a related device.

Background

In some application scenarios, it is necessary to determine which language an image data carrying character information belongs to. For example, if an image data carries a large number of Chinese characters, the language to which the image data belongs is Chinese; if one image data carries a large number of English words, the language to which the image data belongs is English; … … are provided.

However, how to identify the language of an image data is a technical problem to be solved.

Disclosure of Invention

In order to solve the above technical problem, the present application provides an image language identification method and related device, which can accurately identify a language to which image data belongs.

In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

the embodiment of the application provides an image language identification method, which comprises the following steps:

after an image to be processed is obtained, extracting N text images to be used from the image to be processed according to a text detection result of the image to be processed; wherein N is a positive integer;

determining language extraction features of the nth to-be-used text image and visual extraction features of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

determining the image extraction characteristics of the nth to-be-used text image according to the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

and determining the language identification result of the image to be processed according to the image extraction characteristics of the N text images to be used.

In one possible implementation, the visually extracted features include at least one of character density features, color distribution features, and image location features.

In a possible implementation manner, the process of determining the character density feature of the nth text image to be used includes:

inputting the nth to-be-used text image into a pre-constructed density feature extraction model to obtain the character density feature of the nth to-be-used text image output by the density feature extraction model;

the process for determining the color distribution characteristics of the nth text image to be used comprises the following steps:

inputting the nth to-be-used text image into a pre-constructed color feature extraction model to obtain the color distribution feature of the nth to-be-used text image output by the color feature extraction model;

the process for determining the image position feature of the nth text image to be used comprises the following steps:

and inputting the position description information of the nth to-be-used text image into a pre-constructed position feature extraction model to obtain the image position feature of the nth to-be-used text image output by the position feature extraction model.

In a possible implementation manner, the determining process of the nth language extraction feature of the text image to be used includes:

and inputting the nth to-be-used text image into a pre-constructed language feature extraction model to obtain the language extraction features of the nth to-be-used text image output by the language feature extraction model.

In one possible implementation, the visual extraction features include character density features, color distribution features, and image location features;

the determining the image extraction features of the nth to-be-used text image according to the language extraction features of the nth to-be-used text image and the visual extraction features of the nth to-be-used text image includes:

and splicing the language extraction features of the nth to-be-used text image, the character density features of the nth to-be-used text image, the color distribution features of the nth to-be-used text image and the image position features of the nth to-be-used text image to obtain the image extraction features of the nth to-be-used text image.

In a possible implementation manner, the determining, according to the image extraction features of the N text images to be used, a language identification result of the image to be processed includes:

splicing the image extraction features of the N text images to be used to obtain language representation data of the images to be processed;

and inputting the language representation data into a pre-constructed image language identification model to obtain a language identification result of the image to be processed, which is output by the image language identification model.

In a possible implementation manner, the construction process of the image language identification model includes:

acquiring a sample image to be used and the actual language of the sample image to be used;

determining at least one sample text image and position description information of the at least one sample text image according to the text detection result of the sample image to be used;

inputting the at least one sample text image and the position description information of the at least one sample text image into a model to be trained to obtain a language identification result of the sample image to be used, which is output by the model to be trained;

updating the model to be trained according to the language identification result of the sample image to be used and the actual language of the sample image to be used, and continuing to execute the step of inputting the position description information of the at least one sample text image and the at least one sample text image into the model to be trained until the image language identification model is determined according to the model to be trained after a preset stop condition is reached.

In one possible implementation, the model to be trained includes a language feature extraction network, a density feature extraction network, a color feature extraction network, a position feature extraction network, a feature splicing network, and an image language identification network; wherein, the input data of the image language identification network comprises the output data of the feature splicing network; the input data of the feature splicing network comprises output data of the language feature extraction network, output data of the density feature extraction network, output data of the color feature extraction network and output data of the position feature extraction network;

the determining the image language identification model according to the model to be trained comprises:

and determining the image language identification network in the model to be trained as the image language identification model.

In a possible implementation manner, before the inputting the at least one sample text image and the location description information of the at least one sample text image into the model to be trained, the constructing process of the image language identification model further includes:

training a first model by using a first text image and the actual language features of the first text image;

training a second model by using a second text image and the actual density characteristic of the second text image;

training a third model by using a third text image and the actual color features of the third text image;

training a fourth model by using the position description information of a fourth text image and the actual position characteristic of the fourth text image;

respectively initializing the language feature extraction network, the density feature extraction network, the color feature extraction network and the position feature extraction network in the model to be trained by using the trained first model, the trained second model, the trained third model and the trained fourth model.

The embodiment of the present application further provides an image language identification device, including:

the image extraction unit is used for extracting N text images to be used from the images to be processed according to the text detection result of the images to be processed after the images to be processed are obtained; wherein N is a positive integer;

the characteristic determining unit is used for determining the language extraction characteristic of the nth to-be-used text image and the visual extraction characteristic of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

the feature processing unit is used for determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

and the language identification unit is used for determining a language identification result of the image to be processed according to the image extraction characteristics of the N text images to be used.

An embodiment of the present application further provides an apparatus, where the apparatus includes a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to execute any implementation manner of the image language identification method provided by the embodiment of the application according to the computer program.

The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation manner of the image language identification method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device is enabled to execute any implementation manner of the image language identification method provided in the embodiment of the present application.

Compared with the prior art, the embodiment of the application has at least the following advantages:

according to the technical scheme provided by the embodiment of the application, after the image to be processed is obtained, N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed; then determining the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; then, determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; finally, according to the image extraction features of the N text images to be used, the language identification result of the image to be processed is determined, so that the language identification result can accurately represent the language of the image to be processed, and the language of one image data can be accurately identified.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of image data provided in an embodiment of the present application;

fig. 2 is a flowchart of an image language identification method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a text area according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a language feature extraction model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a density feature extraction model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a color feature extraction model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image language identification model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image language identification device according to an embodiment of the present application.

Detailed Description

The inventors have found in the above-described study on the "language to which image data belongs" that, for one image data (such as the image data shown in fig. 1), the language to which the image data belongs can be determined based on the language to which a large amount of character information carried by the image data belongs. For ease of understanding, the following description is made with reference to examples.

As an example, although the image data shown in fig. 1 carries both the character information belonging to vietnamese and the character information belonging to english, since the number of the character information belonging to vietnamese is much larger than the number of the character information belonging to english, it can be determined that the language to which the image data belongs is vietnamese.

Based on the above findings, in order to solve the technical problems in the background art section, an embodiment of the present application provides an image language identification method, including: after an image to be processed is obtained, extracting N text images to be used from the image to be processed according to a text detection result of the image to be processed; then determining the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; then, determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; finally, according to the image extraction features of the N text images to be used, the language identification result of the image to be processed is determined, so that the language identification result can accurately represent the language of the image to be processed, and the language of one image data can be accurately identified.

In addition, the embodiment of the present application does not limit the execution subject of the image language identification method, and for example, the image language identification method provided by the embodiment of the present application may be applied to a data processing device such as a terminal device or a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like. The server may be a stand-alone server, a cluster server, or a cloud server.

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Method embodiment one

Referring to fig. 2, the figure is a flowchart of an image language identification method according to an embodiment of the present application.

The image language identification method provided by the embodiment of the application comprises the following steps of S1-S5:

s1: after the images to be processed are obtained, extracting N text images to be used from the images to be processed according to the text detection result of the images to be processed. Wherein N is a positive integer.

The "image to be processed" refers to image data (for example, the image data shown in fig. 1) that needs to be subjected to image language identification processing; and the "image to be processed" includes character information in at least one language.

The text detection result of the image to be processed is used for indicating the position of at least one text region in the image to be processed. For example, as shown in fig. 3, when the image to be processed is the image data shown in fig. 1, the text detection result of the image to be processed may include the position description data of the first text region, the position description data of the second text region, … …, and the position description data of the fifth text region. Wherein "position description data of the first text region" is used to indicate a position of the first text region in the image data shown in fig. 1; "position description data of the second text region" is used to indicate a position where the second text region is located in the image data shown in fig. 1; … … (and so on); "position description data of the fifth text region" is used to indicate a position where the fifth text region is located in the image data shown in fig. 1.

It should be noted that the present embodiment is not limited to the above-described manner of representing the "position description data", and may be represented by, for example, four vertex coordinates of one text region.

In addition, the embodiment of the present application does not limit the determination process of the "text detection result of the image to be processed", for example, it may specifically be: and inputting the image to be processed into a pre-constructed text detection model to obtain a text detection result of the image to be processed output by the text detection model.

The text detection model is used for carrying out text position detection processing on input data of the text detection model; the embodiment of the present application is not limited to the "text detection model", and may be implemented by using any machine learning model (for example, a deep learning model based on a convolutional neural network).

The "text detection model" described above may be constructed from the first sample image and the actual text position of the first sample image. Wherein, the "actual text position of the first sample image" is used to indicate the actual positions of all text regions in the first sample image; in addition, the embodiment of the present application does not limit the manner of acquiring the "actual text position of the first sample image", and for example, the method may be implemented by a manual labeling method.

The nth text image to be used is used for representing image information carried by the nth text region in the image to be processed; for example, when the "text detection result of the image to be processed" includes the position description data of the nth text region, the image to be processed may be subjected to image truncation processing according to the position description data of the nth text region to obtain the nth text image to be used, so that the nth text image to be used includes the nth text region. Wherein N is a positive integer and is less than or equal to N.

Based on the related content of S1, after the to-be-processed image is acquired, text detection processing may be performed on the to-be-processed image to obtain a text detection result of the to-be-processed image, so that the text detection result can indicate a position of at least one text region in the to-be-processed image; and extracting at least one text image to be used from the image to be processed according to the text detection result, so that each text image to be used comprises each text region, and each text image to be used can represent the image information carried by each text region.

S2: and determining the language extraction characteristics of the nth text image to be used. Wherein N is a positive integer and is less than or equal to N.

The "language extraction feature of the nth text image to be used" is used to represent language information carried by the nth text region in the image to be processed.

In addition, the embodiment of S2 is not limited in this application, and for example, it may specifically include: and inputting the nth to-be-used text image into a pre-constructed language feature extraction model to obtain the language extraction features of the nth to-be-used text image output by the language feature extraction model.

The language feature extraction model is used for performing language feature extraction processing on input data of the language feature extraction model; the language feature extraction model is not limited in the embodiments of the present application, and may be implemented by using any machine learning model (for example, a deep learning model based on a neural network for self-attention learning).

In addition, the language feature extraction model may be constructed according to the first text image and the actual language feature of the first text image. The "actual language feature of the first text image" is used to represent language information actually carried by the first text image; in addition, the embodiment of the present application does not limit the manner of acquiring the "actual language feature of the first text image", and for example, the method may be implemented by a manual labeling method.

It should be noted that, the embodiment of the present application is not limited to the above-mentioned construction process of the language feature extraction model, and for example, the construction process may be implemented by using any existing or future machine learning model construction method. As another example, can adoptMethod embodiment twoThe model construction method shown is implemented.

In addition, in order to improve the extraction effect of the language features, the embodiment of the present application further provides a possible implementation manner of the "language feature extraction model", which may specifically include: the device comprises an image feature extraction layer, a position coding layer, a first feature fusion layer, a first feature coding layer and a first linear processing layer. Wherein the input data of the first linear processing layer comprises the output data of the first characteristic coding layer; the input data of the first feature encoding layer comprises the output data of the first feature fusion layer; the input data of the first feature fusion layer includes output data of the image feature extraction layer and output data of the position encoding layer.

The "image feature extraction layer" is configured to perform image feature extraction processing on one text image data (for example, the nth to-be-used text image); the embodiment of the present application is not limited to the implementation of the "image feature extraction layer", and for example, the embodiment may be implemented by using a Convolutional Neural Network (CNN) shown in fig. 4.

The "position coding layer" is used for performing position coding processing on one text image data; the embodiment of the present application is not limited to the implementation of the "position coding layer," and may be implemented by any position coding processing method (for example, a Positional Encoding module in a transform model).

The "first feature fusion layer" is used to perform feature fusion processing (for example, addition processing shown in fig. 4) on the input data of the first feature fusion layer; the present embodiment is not limited to the embodiment of the "first feature fusion layer", and may be implemented by any feature fusion processing method (for example, a feature fusion processing method according to a transform model).

The "first feature encoding layer" is configured to perform encoding processing on input data of the first feature encoding layer; furthermore, the embodiment of the present application is not limited to the "first feature encoding layer", and for example, L shown in fig. 4 may be used₁A first coding network (e.g., an Encoder module in the transform model) is implemented. Wherein L is₁Is a positive integer.

The first linear processing layer is used for carrying out linear processing on the input data of the first linear processing layer; the present embodiment is not limited to the implementation of the "first linear processing layer", and may be implemented by any linear processing method (for example, a linear module in a transform model).

It should be noted that, for the language feature extraction model shown in fig. 4, "CNN" is used to represent the above-mentioned "image feature extraction layer"; "Positional Encoding" is used to denote the above-mentioned "position Encoding layer"; "+" is used to denote the "first feature fusion layer" described above; "Multi-head attention" refers to a Multi-head self-attention network; "ADD&norm refers to feature addition processing and feature normalization processing; "Feed forward" refers to a Feed forward neural network; "L₁"indicates the number of first coding networks.

Based on the above related content of S2, after the nth to-be-used text image is acquired, language feature extraction processing may be performed on the nth to-be-used text image to obtain a language extraction feature of the nth to-be-used text image, so that the language extraction feature can indicate language information carried in the nth text region in the to-be-used image. Wherein N is a positive integer and is less than or equal to N.

S3: and determining the visual extraction characteristics of the nth text image to be used. Wherein N is a positive integer and is less than or equal to N.

Here, the "visual extraction feature of the nth to-be-used text image" is used to represent image feature information (e.g., character density, color distribution, position distribution in the to-be-processed image, etc.) carried by the nth text region in the to-be-processed image.

In addition, the embodiment of the present application does not limit the above-mentioned "visual extraction feature of the nth to-be-used text image", for example, it may specifically include at least one of a character density feature of the nth to-be-used text image, a color distribution feature of the nth to-be-used text image, and an image position feature of the nth to-be-used text image.

The character density characteristic of the nth text image to be used is used for expressing the character distribution density in the nth text image to be used; the determination process of the "character density feature of the nth to-be-used text image" is not limited in the embodiment of the present application, and for example, the determination process may specifically include: inputting the nth text image to be used into a pre-constructed density feature extraction model to obtain the character density feature of the nth text image to be used output by the density feature extraction model.

The density feature extraction model is used for performing character density feature extraction processing on input data of the density feature extraction model; the embodiment of the present application is not limited to the "density feature extraction model", and may be implemented using any machine learning model (for example, a deep learning model based on a neural network for self-attention learning).

In addition, the "density feature extraction model" may be constructed based on the second text image and the actual density feature of the second text image. The "actual density feature of the second text image" is used for representing the actual character distribution density in the second text image; in addition, the embodiment of the present application does not limit the manner of acquiring the "actual density feature of the second text image", and for example, the method may be implemented by a manual labeling method.

It should be noted that the embodiment of the present application is not limited to the above-mentioned "density feature extraction model" construction process, and for example, the construction process may be implemented by using any existing or future machine learning model construction method. As another example, can adoptMethod embodiment twoThe model construction method shown is implemented. In additionIn addition, the embodiment of the present application also does not limit the association relationship between the "second text image" and the "first text image", and for example, the two text images may refer to the same text image data or different text image data.

In addition, in order to improve the extraction effect of the character density feature, the embodiment of the present application further provides a possible implementation manner of the "density feature extraction model", which may specifically include: image feature extraction layer and L₂A second encoding network. Wherein L is₂Is a positive integer, and L is not limited in the examples of the present application₂For example, as shown in FIG. 5, L₂Specifically, it may be 4.

The content of the "image feature extraction layer" refers to the content of the "image feature extraction layer" above. In addition, the embodiment of the present application is not limited to the "second coding network", and may be implemented by using any existing or future coding network (for example, an Encoder module in a transform model, an Encoder module in a former model, or the like).

The "color distribution characteristic of the nth text image to be used" is used to indicate a color distribution state (in particular, a difference between a character color and a background color) in the nth text image to be used; the embodiment of the present application does not limit the determination process of the "color distribution characteristic of the nth to-be-used text image", and for example, the determination process may specifically include: and inputting the nth to-be-used text image into a pre-constructed color feature extraction model to obtain the color distribution feature of the nth to-be-used text image output by the color feature extraction model.

The color feature extraction model is used for performing color distribution feature extraction processing on input data of the color feature extraction model; the present embodiment is not limited to the "color feature extraction model", and may be implemented by using any machine learning model (for example, a deep learning model based on a convolutional neural network).

In addition, the "color feature extraction model" may be constructed from the third text image and the actual color features of the third text image. Wherein, the "actual color feature of the third text image" is used to represent the actual color distribution state in the third text image; in addition, the embodiment of the present application does not limit the manner of obtaining the "actual color feature of the third text image", and for example, the method may be implemented by a manual labeling method.

It should be noted that the embodiment of the present application is not limited to the above-mentioned building process of the "color feature extraction model", and for example, the building process may be implemented by using any existing or future machine learning model building method. As another example, can adoptMethod embodiment twoThe model construction method shown is implemented. In the embodiment of the present application, the association relationship between the "third text image", the "second text image", and the "first text image" is not limited, and for example, the three may be the same text image data or different text image data.

In addition, in order to improve the extraction effect of the color distribution features, the embodiment of the present application further provides a possible implementation manner of the "color feature extraction model", which may specifically include: image feature extraction layer and L₃A second encoding network. Wherein L is₃Is a positive integer, and L is not limited in the examples of the present application₃For example, as shown in FIG. 6, L₃Specifically, it may be 2.

It should be noted that the relevant content of the "image feature extraction layer" refers to the relevant content of the "image feature extraction layer" above; the content of the "second coding network" refers to the content of the "second coding network" above.

The "image position feature of the nth to-be-used text image" refers to a position distribution state of character information in the nth to-be-used text image in the "to-be-processed image"; moreover, the embodiment of the present application does not limit the determination process of the "image position feature of the nth to-be-used text image", and for example, the determination process may specifically include: and inputting the position description information of the nth to-be-used text image into a pre-constructed position feature extraction model to obtain the image position feature of the nth to-be-used text image output by the position feature extraction model.

The "position description information of the nth to-be-used text image" is used for describing the position of the character information in the nth to-be-used text image in the "to-be-processed image"; moreover, the embodiment of the present application does not limit the determination process of the "position description information of the nth to-be-used text image", for example, when the above-mentioned "nth to-be-used text image" is used to indicate the image information carried by the nth text region in the to-be-processed image, the position description information of the nth text region in the to-be-processed image may be determined as the position description information of the nth to-be-used text image.

The "position feature extraction model" is used for performing image position feature extraction processing on input data of the position feature extraction model; the embodiment of the present application is not limited to the "position feature extraction model", and may be implemented using any machine learning model (for example, a machine learning model based on a full connection layer). As another example, the "location feature extraction model" may include 2 fully connected layers.

In addition, the "position feature extraction model" may be constructed based on the position description information of the fourth text image and the actual position feature of the fourth text image.

The "position description information of the fourth text image" is used for indicating the position of the character information in the fourth text image in the sample image to be processed; and the fourth text image is obtained by performing image interception processing on the sample image to be processed.

The "actual position feature of the fourth text image" is used for representing an actual position distribution state of character information in the fourth text image in the sample image to be processed; in addition, the embodiment of the present application does not limit the manner of acquiring the "actual position feature of the fourth text image", and for example, the method may be implemented by a manual labeling method.

It should be noted that the embodiment of the present application does not limit the above "position feature extraction model"The building process of (2) can be implemented, for example, by using any machine learning model building method existing or appearing in the future. As another example, can adoptMethod embodiment twoThe model construction method shown is implemented. In the embodiment of the present application, the association relationship between the "fourth text image", the "third text image", the "second text image", and the "first text image" is not limited, and for example, the four text images may refer to the same text image data or different text image data.

Based on the above-mentioned related content of S3, after the nth text image to be used is acquired, preset visual feature extraction processing (e.g., character density feature extraction processing, color distribution feature extraction processing, image position feature extraction processing, and the like) may be performed on the nth text image to be used to obtain the visual extraction feature of the nth text image to be used, so that the visual extraction feature can represent image feature information (e.g., character density, color distribution, position distribution in the image to be processed, and the like) carried by the image to be processed. Wherein N is a positive integer and is less than or equal to N.

S4: and determining the image extraction characteristics of the nth to-be-used text image according to the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image. Wherein N is a positive integer and is less than or equal to N.

The "image extraction feature of the nth to-be-used text image" is used to indicate image information (e.g., information about a character language, a character density, a color distribution, a position distribution in the to-be-processed image, etc.) carried by the nth to-be-used text image, so that the "image extraction feature of the nth to-be-used text image" can accurately indicate image information carried by the nth text region in the to-be-processed image.

In addition, the present embodiment is not limited to the implementation of S4, for example, when the above "visual extraction feature" includes a character density feature, a color distribution feature, and an image position feature, S4 may specifically include: and splicing the language extraction feature of the nth to-be-used text image, the character density feature of the nth to-be-used text image, the color distribution feature of the nth to-be-used text image and the image position feature of the nth to-be-used text image to obtain the image extraction feature of the nth to-be-used text image.

It should be noted that the embodiment of the present application is not limited to the implementation of the "stitching", for example, when the language extraction feature of the nth to-be-used text image, the character density feature of the nth to-be-used text image, the color distribution feature of the nth to-be-used text image, and the image position feature of the nth to-be-used text image are all feature vectors of 1 × 512, the image extraction feature of the nth to-be-used text image may be a feature vector of 4 × 512.

Based on the above-mentioned related content of S4, after the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image are obtained, the image extraction feature of the nth to-be-used text image may be determined by referring to the above-mentioned two extraction features, so that the image extraction feature can accurately represent the image information (e.g., information such as the language of characters, the density of characters, the color distribution, the position distribution in the to-be-processed image, etc.) carried by the nth text region in the to-be-processed image.

S5: and determining the language identification result of the image to be processed according to the image extraction characteristics of the N text images to be used.

The language identification result of the image to be processed is used to indicate the language to which the image to be processed belongs, so that the language identification result of the image to be processed can accurately indicate the language to which most of the character information in the image to be processed belongs (for example, vietnamese shown in fig. 1).

In addition, the examples of the present application do not limit the implementation manner of S5, and for example, it may specifically include S51-S52:

s51: and splicing the image extraction features of the N text images to be used to obtain language representation data of the images to be processed.

The language representation data of the image to be processed is used to represent the distribution characteristics (e.g., distribution range, distribution position, etc.) of at least one language in the image to be processed.

In addition, the embodiment of the present application is not limited to the implementation of "stitching" in S51, for example, when the image extraction feature of the nth to-be-used text image is a feature vector of 4 × 512, the language characterization data of the to-be-processed image may be a feature vector of N × 4 × 512.

The above-mentioned "1 × 512", "4 × 512", and "N × 4 × 512" all refer to the data dimension of one feature vector.

S52: and inputting language representation data of the image to be processed into a pre-constructed image language identification model to obtain a language identification result of the image to be processed, which is output by the image language identification model.

The image language identification model is used for carrying out language identification processing on input data of the image language identification model; the embodiment of the present application is not limited to the "image language recognition model", and may be implemented by using any machine learning model (for example, a deep learning model based on a neural network for self-attention learning).

In addition, in order to improve the recognition effect of the image languages, the embodiment of the present application further provides a possible implementation manner of the "image language recognition model", which may specifically include: l is₄A second encoding network, a second linear processing layer, and an identification layer. Wherein L is₄Is a positive integer, and L is not limited in the examples of the present application₄For example, as shown in FIG. 7, L₄Specifically, it may be 6.

The "second linear processing layer" is configured to perform linear processing on input data of the second linear processing layer; the present embodiment is not limited to the implementation of the "second linear processing layer", and may be implemented by any linear processing method (for example, a linear module in a transform model).

The recognition layer is used for carrying out language classification processing on the input data of the recognition layer; the embodiment of the present application is not limited to the implementation of the "recognition layer," and may be implemented by any classification method (for example, softmax module in a transform model).

It should be noted that the relevant content of the "second coding network" refers to the relevant content of the "second coding network" above.

In addition, the embodiment of the present application does not limit the construction process of the "image language identification model", for example, the "image language identification model" may be constructed according to the language representation data of the sample image to be used and the actual language of the sample image to be used. The determination process of the "language representation data of the sample image to be used" is similar to the determination process of the "language representation data of the image to be processed" described above. The "actual language of the sample image to be used" is used to indicate the language to which the sample image to be used actually belongs. As another example, it can adoptMethod embodiment twoThe model building process shown is performed.

Based on the related contents of S51 to S52, after the image extraction features of the 1 st to-be-used text image, the image extraction features of the 2 nd to-be-used text image, … …, and the image extraction features of the nth to-be-used text image are obtained, the image extraction features of the N to-be-used text images may be first spliced to obtain language representation data of the to-be-processed image; and performing language identification processing on the language representation data to obtain a language identification result of the image to be processed, so that the language identification result can indicate a language to which most of character information in the image to be processed belongs (for example, vietnamese shown in fig. 1).

Based on the above-mentioned related contents of S1 to S5, for the image language identification method provided in the embodiment of the present application, after the image to be processed is acquired, N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed; then determining the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; then, determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; finally, according to the image extraction features of the N text images to be used, the language identification result of the image to be processed is determined, so that the language identification result can accurately represent the language of the image to be processed, and the language of one image data can be accurately identified.

Method embodiment two

In order to improve the recognition effect of the language of the image, an embodiment of the present application further provides a model construction method, which may specifically include steps 11 to 16:

step 11: and acquiring a sample image to be used and the actual language of the sample image to be used.

The 'to-be-used sample image' refers to image data required to be used in the model construction process; and the "sample image to be used" may include character information in at least one language.

The 'actual language of the sample image to be used' is used for indicating the language to which the sample image to be used actually belongs; in addition, the embodiment of the present application does not limit the obtaining manner of the "actual language of the sample image to be used", and for example, the obtaining may be performed in a manner of manual labeling.

Step 12: and determining at least one sample text image and the position description information of the at least one sample text image according to the text detection result of the sample image to be used.

The text detection result of the sample image to be used is used for indicating the position of at least one text region in the sample image to be used; and the determination process of the "text detection result of the sample image to be used" is similar to the above determination process of the "text detection result of the image to be processed".

In addition, the above determination process of the "sample text image" is similar to the above determination process of the "text image to use"; also, the determination process of "position specification information of sample text image" is similar to the above determination process of "position specification information of text image to be used".

Step 13: and inputting at least one sample text image and the position description information of the at least one sample text image into the model to be trained to obtain the language identification result of the sample image to be used, which is output by the model to be trained.

The model to be trained is used for carrying out image language identification processing on input data of the model to be trained.

In addition, the embodiment of the present application does not limit the "model to be trained," and for example, the method may specifically include: language feature extraction network, density feature extraction network, color feature extraction network, position feature extraction network, feature concatenation network, and image language identification network. The input data of the image language identification network comprises output data of a feature splicing network; the input data of the feature splicing network comprises output data of a language feature extraction network, output data of a density feature extraction network, output data of a color feature extraction network and output data of a position feature extraction network.

The above-mentioned "language feature extraction network" is used to perform language feature extraction processing with respect to text image data (for example, each sample text image); the network structure of the "language feature extraction network" is not limited in the embodiments of the present application, and for example, the network structure may be implemented by using the model structure of the above "language feature extraction model".

The above-mentioned "density feature extraction network" is used to perform character density feature extraction processing for text image data (for example, each sample text image); the network structure of the "density feature extraction network" is not limited in the embodiments of the present application, and may be implemented by using the model structure of the "density feature extraction model" described above, for example.

The above-mentioned "color feature extraction network" is used to perform color distribution feature extraction processing for text image data (for example, each sample text image); the network structure of the "color feature extraction network" is not limited in the embodiments of the present application, and may be implemented using the model structure of the above "color feature extraction model", for example.

The above-mentioned "position feature extraction network" is used to perform image position feature extraction processing with respect to position description information of text image data (for example, each sample text image); the network structure of the "location feature extraction network" is not limited in the embodiments of the present application, and may be implemented using, for example, the model structure of the above "location feature extraction model".

The characteristic splicing network is used for splicing input data of the characteristic splicing network; in addition, the working principle of the "feature splicing network" is not limited in the embodiments of the present application, and for convenience of understanding, the following description is made with reference to an example.

As an example, when the "at least one sample text image" includes K sample text images, the working principle of the "feature concatenation network" may specifically include: firstly, splicing language extraction features of a kth sample text image, character density features of the kth sample text image, color distribution features of the kth sample text image and image position features of the kth sample text image to obtain image extraction features of the kth sample text image; k is a positive integer, K is less than or equal to K, and K is a positive integer; and then splicing the image extraction features of the 1 st sample text image to the image extraction features of the Kth sample text image to obtain language representation data of the sample image to be used (namely, an output result of the feature splicing network).

The image language identification network is used for carrying out language identification processing on input data of the image language identification network; the network structure of the "image language identification network" is not limited in the embodiments of the present application, and for example, it may be implemented by using the model structure of the above "image language identification model".

Based on the related content of step 13, after obtaining at least one sample text image and the location description information of the at least one sample text image, the at least one sample text image and the location description information thereof may be input into the model to be trained, so that the model to be trained performs image language identification processing with reference to the at least one sample text image and the location description information thereof, and obtains and outputs a language identification result of the sample image to be used.

Step 14: judging whether a preset stop condition is reached, if so, executing a step 16; if not, go to step 15.

Wherein, the "preset stop condition" may be preset; the embodiment of the present application does not limit the "preset stop condition", for example, it may specifically be that the loss value of the model to be trained is lower than a first threshold; the change rate of the loss value of the model to be trained may also be lower than a second threshold (that is, the image language recognition performance of the model to be trained reaches convergence), and the number of times of updating the model to be trained may also reach a third threshold.

The loss value of the model to be trained is used for representing the image language identification performance of the model to be trained; in addition, the determination process of the loss value of the model to be trained is not limited in the embodiment of the application, and the determination process can be implemented by adopting any existing or future model loss value determination method.

Step 15: and updating the model to be trained according to the language identification result of the sample image to be used and the actual language of the sample image to be used, and returning to continue to execute the step 13.

In the embodiment of the application, after determining that the model to be trained of the current round does not reach the preset stop condition, it can be determined that the image language identification performance of the model to be trained is still poor, so that the model to be trained can be updated according to the difference between the language identification result of the sample image to be used and the actual language of the sample image to be used, so that the updated model to be trained can have better image language identification performance; and continuing to execute the step 13 and the subsequent steps based on the updated model to be trained so as to realize a new training process aiming at the model to be trained.

It should be noted that, the embodiment of the present application does not limit the updating process of the model to be trained, and may be implemented by using any existing or future model updating method.

Step 16: and determining an image language identification model according to the model to be trained.

In the embodiment of the application, after it is determined that the model to be trained of the current round has reached the preset stop condition, it may be determined that the model to be trained has better image language recognition performance, so that the image language recognition model may be determined according to the model to be trained (for example, an image language recognition network in the model to be trained may be directly determined as the image language recognition model).

In addition, in some application scenarios, other models (for example, a language feature extraction model, a density feature extraction model, a color feature extraction model, a position feature extraction model, and the like) may also be determined by using the trained model to be trained. Based on this, step 16 may specifically include: determining a language feature extraction network in a model to be trained as a language feature extraction model; determining a density feature extraction network in a model to be trained as a density feature extraction model; determining a color feature extraction network in a model to be trained as a color feature extraction model; determining a position feature extraction network in a model to be trained as a position feature extraction model; and determining an image language identification network in the model to be trained as an image language identification model.

Based on the related contents of the above steps 11 to 16, in some cases, the language feature extraction model, the density feature extraction model, the color feature extraction model, the position feature extraction model, and the image language identification model may be constructed by means of a training process of one model, so that the image language identification method implemented based on these five models has a better image language identification effect.

In addition, in order to further improve the model building effect, the embodiment of the present application provides another possible implementation manner of the model building method, and in this implementation manner, the model building method may further include, in addition to the above step 11 to step 16, step 17 to step 21:

and step 17: and training the first model by using the first text image and the actual language feature of the first text image, so that the trained first model has a better language feature extraction effect.

Step 18: and training a second model by using the second text image and the actual density characteristic of the second text image, so that the trained second model has a better character density characteristic extraction effect.

Step 19: and training a third model by using the third text image and the actual color feature of the third text image, so that the trained third model has a better color distribution feature extraction effect.

Step 20: and training a fourth model by using the position description information of the fourth text image and the actual position characteristic of the fourth text image, so that the trained fourth model has a better image position characteristic extraction effect.

Step 21: and respectively carrying out initialization processing on a language feature extraction network, a density feature extraction network, a color feature extraction network and a position feature extraction network in the model to be trained by utilizing the trained first model, the trained second model, the trained third model and the trained fourth model.

It should be noted that the embodiment of the present application is not limited to the specific implementation of step 21, and for example, the specific implementation may specifically include: determining the trained first model as an initialization processing result of a language feature extraction network in the model to be trained; determining the trained second model as an initialization processing result of a density feature extraction network in the model to be trained; determining the trained third model as an initialization processing result of the color feature extraction network in the model to be trained; and determining the trained fourth model as an initialization processing result of the position feature extraction network in the model to be trained.

Based on the related contents of the above steps 17 to 21, in some cases, the first model to the fifth model may be trained first; initializing a language feature extraction network, a density feature extraction network, a color feature extraction network and a position feature extraction network in the model to be trained by using the trained first model to the trained fifth model to obtain an initialized model to be trained; then, training the initialized model to be trained by utilizing the steps 11 to 15 to obtain a trained model to be trained; and finally, determining a language characteristic extraction model, a density characteristic extraction model, a color characteristic extraction model, a position characteristic extraction model and an image language identification model from the trained model to be trained.

Based on the image language identification method provided by the above method embodiment, the embodiment of the present application further provides an image language identification device, which is explained and explained below with reference to the accompanying drawings.

Device embodiment

Please refer to the above method embodiment for the technical details of the image language identification apparatus provided in the apparatus embodiment.

Referring to fig. 8, the figure is a schematic structural diagram of an image language identification apparatus according to an embodiment of the present application.

The image language identification apparatus 800 provided in the embodiment of the present application includes:

the image extraction unit 801 is configured to, after an image to be processed is acquired, extract N text images to be used from the image to be processed according to a text detection result of the image to be processed; wherein N is a positive integer;

a feature determining unit 802, configured to determine a language extraction feature of the nth to-be-used text image and a visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

a feature processing unit 803, configured to determine an image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

and a language identification unit 804, configured to determine a language identification result of the image to be processed according to the image extraction features of the N text images to be used.

In a possible implementation, the feature determining unit 802 includes:

the first determining subunit is configured to input the nth to-be-used text image into a pre-constructed density feature extraction model, so as to obtain a character density feature of the nth to-be-used text image output by the density feature extraction model;

the second determining subunit is configured to input the nth to-be-used text image into a pre-constructed color feature extraction model, so as to obtain a color distribution feature of the nth to-be-used text image output by the color feature extraction model;

and the third determining subunit is configured to input the position description information of the nth to-be-used text image into a pre-constructed position feature extraction model, so as to obtain the image position feature of the nth to-be-used text image output by the position feature extraction model.

In a possible implementation, the feature determining unit 802 includes:

and the fourth determining subunit is configured to input the nth to-be-used text image into a pre-constructed language feature extraction model, so as to obtain a language extraction feature of the nth to-be-used text image output by the language feature extraction model.

the feature processing unit 803 is specifically configured to: and splicing the language extraction features of the nth to-be-used text image, the character density features of the nth to-be-used text image, the color distribution features of the nth to-be-used text image and the image position features of the nth to-be-used text image to obtain the image extraction features of the nth to-be-used text image.

In a possible implementation manner, the language identification unit 804 is specifically configured to: splicing the image extraction features of the N text images to be used to obtain language representation data of the images to be processed; and inputting the language representation data into a pre-constructed image language identification model to obtain a language identification result of the image to be processed, which is output by the image language identification model.

In a possible implementation manner, the image language identification apparatus 800 further includes:

the model training unit is used for acquiring a sample image to be used and the actual language of the sample image to be used; determining at least one sample text image and position description information of the at least one sample text image according to the text detection result of the sample image to be used; inputting the at least one sample text image and the position description information of the at least one sample text image into a model to be trained to obtain a language identification result of the sample image to be used, which is output by the model to be trained; updating the model to be trained according to the language identification result of the sample image to be used and the actual language of the sample image to be used, and continuing to execute the step of inputting the position description information of the at least one sample text image and the at least one sample text image into the model to be trained until the image language identification model is determined according to the model to be trained after a preset stop condition is reached.

the determining process of the image language identification model comprises the following steps: and determining the image language identification network in the model to be trained as the image language identification model.

the model initialization unit is used for training a first model by utilizing a first text image and the actual language features of the first text image; training a second model by using a second text image and the actual density characteristic of the second text image; training a third model by using a third text image and the actual color features of the third text image; training a fourth model by using the position description information of a fourth text image and the actual position characteristic of the fourth text image; respectively initializing the language feature extraction network, the density feature extraction network, the color feature extraction network and the position feature extraction network in the model to be trained by using the trained first model, the trained second model, the trained third model and the trained fourth model.

Based on the related content of the image language identification device 200, for the image language identification device 200, after the image to be processed is obtained, N text images to be used are extracted from the image to be processed according to the text detection result of the image to be processed; then determining the language extraction characteristics of the nth to-be-used text image and the visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; then, determining the image extraction feature of the nth to-be-used text image according to the language extraction feature of the nth to-be-used text image and the visual extraction feature of the nth to-be-used text image; wherein N is a positive integer, N is not more than N, and N is a positive integer; finally, according to the image extraction features of the N text images to be used, the language identification result of the image to be processed is determined, so that the language identification result can accurately represent the language of the image to be processed, and the language of one image data can be accurately identified.

Further, an embodiment of the present application further provides an apparatus, where the apparatus includes a processor and a memory:

the memory is used for storing a computer program;

Further, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation manner of the image language identification method provided in the embodiment of the present application.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation of the image language identification method provided in the embodiment of the present application.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. An image language identification method is characterized by comprising the following steps:

determining language extraction features of an nth to-be-used text image and visual extraction features of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

2. The method of claim 1, wherein the visually extracted features comprise at least one of character density features, color distribution features, and image location features.

3. The method according to claim 2, wherein the process of determining the character density feature of the nth text image to be used comprises:

4. The method according to claim 1, wherein the determination process of extracting features in the nth language of the text image to be used comprises:

5. The method of claim 1, wherein the visual extraction features include character density features, color distribution features, and image location features;

6. The method according to claim 1, wherein the determining the language identification result of the image to be processed according to the image extraction features of the N text images to be used comprises:

7. The method according to claim 6, wherein the construction process of the image language identification model comprises:

8. The method according to claim 7, wherein the model to be trained comprises a language feature extraction network, a density feature extraction network, a color feature extraction network, a position feature extraction network, a feature concatenation network, and an image language identification network; wherein, the input data of the image language identification network comprises the output data of the feature splicing network; the input data of the feature splicing network comprises output data of the language feature extraction network, output data of the density feature extraction network, output data of the color feature extraction network and output data of the position feature extraction network;

9. The method according to claim 8, wherein before the inputting the at least one sample text image and the position description information of the at least one sample text image into the model to be trained, the constructing process of the image language identification model further comprises:

10. An image language recognition apparatus, comprising:

the characteristic determining unit is used for determining language extraction characteristics of the nth to-be-used text image and visual extraction characteristics of the nth to-be-used text image; wherein N is a positive integer, and N is not more than N;

11. An apparatus, comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the method of any one of claims 1-9 in accordance with the computer program.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-9.

13. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-9.