CN112613572A

CN112613572A - Sample data obtaining method and device, electronic equipment and storage medium

Info

Publication number: CN112613572A
Application number: CN202011608431.8A
Authority: CN
Inventors: 尹天舒
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-06
Anticipated expiration: 2040-12-30
Also published as: CN112613572B

Abstract

The embodiment of the invention provides a sample data obtaining method, a sample data obtaining device, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: obtaining a plurality of groups of continuous characters from a text corpus obtained in advance, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information represents: the title number distribution of sample titles with different character numbers, the first reference information characterizes: the distribution condition of the number of titles of real titles with different numbers of characters; adjusting the character style of the obtained sample title by referring to the character style of the real title; and acquiring an image of which the content contains the adjusted sample title as sample data for each adjusted sample title. By applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

Description

Sample data obtaining method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for obtaining sample data, an electronic device, and a storage medium.

Background

When information such as news, articles, and advertisements is checked, it is generally necessary to identify the title of the information.

Conventionally, a title of information in a text format is generally scanned to obtain an image of the title including the information, and then the scanned image is subjected to Character Recognition by using an OCR (Optical Character Recognition) model to obtain the title of the information. In training an OCR model, sample data needs to be obtained. Currently, it is common to collect images of information-containing titles from various places, for example, by capturing a screen of a web page containing the information-containing title, and capturing images of newspapers and magazines containing the information-containing title, and the like, and to use the obtained images as sample data.

Although sample data can be obtained by applying the above method, a large amount of manpower and material resources are needed to locate the content of the title containing the information to obtain the sample data, which results in low efficiency of obtaining the sample data.

Disclosure of Invention

The embodiment of the invention aims to provide a sample data obtaining method, a sample data obtaining device, electronic equipment and a storage medium, so as to improve the sample data obtaining efficiency. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a method for obtaining sample data, the method including:

obtaining a plurality of groups of continuous characters from a text corpus obtained in advance, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information represents: the title number distribution of sample titles with different character numbers, the first reference information characterizes: the distribution condition of the number of titles of real titles with different numbers of characters;

adjusting the character style of the obtained sample title by referring to the character style of the real title;

and acquiring an image of which the content contains the adjusted sample title as sample data for each adjusted sample title.

In an embodiment of the present invention, in a case that the character style includes a character size, the adjusting the obtained character style of the sample title with reference to the character style of the real title includes:

adjusting the size of the obtained characters of the sample title to make second distribution information corresponding to the adjusted sample title consistent with second reference information, wherein the second distribution information represents: the number distribution of titles of sample titles with different character sizes, and the second reference information characterizes: the distribution of the number of titles of real titles of different character sizes.

In an embodiment of the present invention, in a case that the character style includes a character font, the adjusting the obtained character style of the sample title with reference to the character style of the real title includes:

and adjusting the character fonts of the obtained sample titles to reduce the difference between the proportion occupied by the sample titles of the different adjusted character fonts and the proportion occupied by the real titles of the different character fonts in the real titles.

In an embodiment of the present invention, in a case that the character style includes a character color, the adjusting the obtained character style of the sample title with reference to the character style of the real title includes:

selecting at least one character color from the character colors existing in the real title as a reference color;

and setting the character color of each sample title as at least one color in the reference colors.

In an embodiment of the present invention, the obtaining, for each adjusted sample title, an image whose content includes the adjusted sample title as sample data includes:

and for each adjusted sample title, overlapping the adjusted sample title on a preset background image to obtain a composite image, and capturing an image area containing the adjusted sample title in the composite image as sample data.

and for each adjusted sample title, obtaining an image of which the content contains the adjusted sample title, preprocessing the image, and taking the preprocessed image as sample data, wherein the preprocessing is used for enabling the image to tend to contain a real image of the real title.

In a second aspect of the present invention, there is also provided a sample data obtaining apparatus, including:

a sample title obtaining module, configured to obtain multiple groups of continuous characters from a text corpus obtained in advance, and use each group of continuous characters as a sample title, where first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information represents: the title number distribution of sample titles with different character numbers, the first reference information characterizes: the distribution condition of the number of titles of real titles with different numbers of characters;

the character style adjusting module is used for adjusting the character style of the obtained sample title by referring to the character style of the real title;

and the sample data obtaining module is used for obtaining an image of which the content contains the adjusted sample title as sample data for each adjusted sample title.

In an embodiment of the present invention, when the character style includes a character size, the character style adjusting module is specifically configured to:

In an embodiment of the present invention, when the character style includes a character font, the character style adjusting module is specifically configured to:

In an embodiment of the present invention, when the character style includes a character color, the character style adjusting module is specifically configured to:

In an embodiment of the present invention, the sample data obtaining module is specifically configured to:

In a third aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the first aspect when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when run on a computer, cause the computer to execute any one of the sample data obtaining methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the sample data obtaining methods described above.

When the scheme provided by the embodiment of the invention is applied to obtain sample data, a plurality of groups of continuous characters are obtained from the text corpus obtained in advance, each group of continuous characters is used as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information represents: the title quantity distribution of sample titles with different character quantities, the first reference information characterizes: and according to the distribution condition of the number of titles of the real titles with different numbers of characters, referring to the character style of the real title, adjusting the character style of the obtained sample title, and aiming at each adjusted sample title, obtaining an image of which the content contains the adjusted sample title as sample data. When sample data is obtained, a plurality of groups of continuous characters are directly selected from the text corpus, and the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

In addition, according to the distribution situation of the number of titles of the real titles with different numbers of characters, sample titles with different numbers of characters, which accord with the distribution situation of the number of titles, are obtained, and according to the character style of the real titles, the character style of the sample titles is adjusted, so that the obtained sample titles are close to the real titles, and further, sample data obtained according to the sample titles are close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the accuracy of the obtained sample data can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of a sample data obtaining method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a real title distribution according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of sample data according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of another sample data obtaining method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a model training method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a sample data obtaining apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Because the prior art has the problem of low sample data obtaining efficiency, embodiments of the present invention provide a sample data obtaining method, a sample data obtaining device, an electronic device, and a storage medium to solve the technical problem.

In an embodiment of the present invention, a sample data obtaining method is provided, including:

obtaining a plurality of groups of continuous characters from a text corpus obtained in advance, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: the title quantity distribution of sample titles with different character quantities, the first reference information characterizes: the distribution condition of the number of titles of real titles with different numbers of characters;

When the scheme provided by the embodiment is applied to obtain the sample data, a plurality of groups of continuous characters are directly selected from the text corpus, and the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

In addition, according to the distribution situation of the number of titles of the real titles with different numbers of characters, sample titles with different numbers of characters, which accord with the distribution situation of the number of titles, are obtained, and according to the character style of the real titles, the character style of the sample titles is adjusted, so that the obtained sample titles are close to the real titles, and further, sample data obtained according to the sample titles are close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the accuracy of the obtained sample data can be improved.

In an application scenario of the embodiment of the present invention, the sample data obtained by applying the sample data obtaining method may be used to train an OCR model, and the trained OCR model may be used to perform character recognition on the title content in the image.

The following describes in detail a sample data obtaining method, an apparatus, an electronic device, and a storage medium according to embodiments of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a sample data obtaining method according to an embodiment of the present invention, where the method may be applied to an electronic device such as a desktop computer, a notebook computer, a tablet computer, and the like, and is not limited in particular.

The method comprises the following steps 101 to 103.

Step 101, obtaining a plurality of groups of continuous characters from a text corpus obtained in advance, and using each group of continuous characters as a sample title.

The text corpus may be texts contained in novels, magazines, newspapers, and the like. The continuous characters may be continuous character strings in the text corpus, and may be english characters, numeric characters, chinese characters, and the like, such as "forigners encoded to invent in more industries", "blowing-on non-cold salix mongolica", and the like.

The obtained first distribution information corresponding to the sample header is consistent with the first reference information.

The first distribution information represents: the number of titles of sample titles of different numbers of characters is distributed. The first distribution information may be: the title number of the sample titles of different character numbers is represented by a ratio of the title number of the total sample titles, for example, the first distribution information may be: the title number ratio of the sample title with the character number of 20 is 25%, the title number ratio of the sample title with the character number of 25 is 30%, the title number ratio of the sample title with the character number of 30 is 30%, and the title number ratio of the sample title with the character number of 35 is 15%.

In addition, the first distribution information may be represented by a statistical normal distribution, and specifically, the first distribution information of the sample header may be obtained according to a mean and a variance of the number of characters of each group of sample headers. Thus, when the number of the sample headers is large, the distribution of the sample headers can be more clearly represented by normal distribution.

The first reference information characterizes: the number distribution of titles of real titles of different numbers of characters. The first reference information may be represented by a ratio of the number of titles with different numbers of characters in the real title to the total number of real titles, for example, the first reference information may be: the number of titles of the real titles with the number of characters being 20 is 15%, the number of titles of the real titles with the number of characters being 25 is 30%, the number of titles of the real titles with the number of characters being 30 is 35%, and the number of titles of the real titles with the number of characters being 35 is 20%.

The first reference information may also be represented by a statistical normal distribution, and specifically, the mean and the variance of the number of characters of each real title may be counted, and then the normal distribution of the number of titles of real titles with different numbers of characters may be determined according to the mean and the variance, and used as the first reference information.

For example, assuming that 500 real titles are counted, the number of titles of real titles with different number of characters is shown in the following table 1:

TABLE 1

As can be seen from table 1, in 500 real titles, the number of titles of a real title with 18 characters is 15, the number of titles of a real title with 20 characters is 35, the number of titles of a real title with 22 characters is 50, and so on, the number of titles of a real title with 34 characters is 10.

Referring to fig. 2, fig. 2 is a schematic diagram of a real title distribution according to an embodiment of the present invention. By plotting the number of titles of the real titles with different numbers of characters in table 1, the bar chart shown in fig. 2 can be obtained as a schematic diagram of the distribution of the real titles. In the bar chart shown in fig. 2, the abscissa indicates the number of characters, and the ordinate indicates the number of titles of real titles of the respective numbers of characters. As can be seen from fig. 2, the number of characters of the 500 real titles is mostly distributed near 26, and the number of the real titles far below 26 characters or far more than 26 characters is small, so that the obtained 500 real titles with different numbers of characters conform to a normal distribution.

The mean and variance of the number of characters of 500 real titles may be calculated and substituted into the normal distribution formula based on the above mean and variance:

where μ represents a mean value and σ represents a variance. And substituting the mean value and the variance into a normal distribution formula to obtain the title quantity distribution condition of the real titles with different character quantities, and further obtaining first reference information.

In addition to this, the first distribution information or the first reference information may be represented by a poisson distribution, a binomial distribution, or the like in statistics. The embodiments of the present invention are not limited thereto.

Specifically, the distribution of the number of titles of actual titles with different numbers of characters may be obtained in advance as the first reference information, then the number of titles of sample titles with different numbers of characters to be obtained is determined according to the first reference information and the number of preset total sample titles, and according to the determined number of titles, consecutive characters with different numbers of characters are obtained from the text corpus, so as to obtain a sample title whose corresponding first distribution information is consistent with the first reference information.

For example, assume that the first reference information is: the title number ratio of the real title with the number of characters of 18 is 45%, the title number ratio of the real title with the number of characters of 12 is 55%, and the number of the preset total sample titles is 20000, then it can be determined that the title number of the sample title with the number of characters of 18 is 9000, and the title number of the sample title with the number of characters of 12 is 11000 in the sample title to be obtained, so that 9000 groups of 18-character-number continuous characters can be obtained from the text corpus, and 11000 groups of 12-character-number continuous characters can be obtained, thereby obtaining the sample title of which the corresponding first distribution information is consistent with the first reference information.

The first reference information can represent the distribution condition of the number of titles of real titles with different numbers of characters, and the obtained first distribution information corresponding to the sample title is consistent with the first reference information, that is, the number of characters of each group of characters is close to the number of characters in the real title, so that the number of characters of the obtained sample title is close to the number of characters in the real title. Therefore, the sample data obtained according to the sample title is closer to the real data, and the accuracy of the obtained sample data is improved.

And 102, referring to the character style of the real title, and adjusting the character style of the obtained sample title.

The character style may include at least one of character font, character size, character color, character interval, character typesetting mode, and the like.

In the case where the character is a chinese character, the character font may be a regular script, a song font, a clerk font, or the like, and in the case where the character is an english character, the character font may be a round font, an italian font, a roman font, or the like.

The character size may be expressed in units of pixels, such as 32 pixels by 52 pixels, 40 pixels by 60 pixels, and the like, and may also be expressed in terms of width values or height values in units of pixels, such as 32 pixels in height values, 40 pixels in width values, and the like. The character size may also be expressed in font No. 4, font No. 2, font No. 17, and the like.

The character color may be black, blue, red, etc., and may also be a gradient color, a spot color, a stripe color, etc.

The character spacing may be a fixed value, such as 20 pounds, 15 pounds, 30 pounds, etc., and may be a single pitch, 1.5 times pitch, 2.0 times pitch, etc.

The character typesetting mode can be horizontal and vertical.

Specifically, the character style of the real title may be obtained in advance, for example, the character font, the character size, and the like of the real title may be obtained in advance, and the obtained character style of the sample title, such as the character font, the character size, and the like, is adjusted with reference to the character style of the real title, so that the character style of the sample title approaches the character style of the real title, and further, when sample data is obtained according to the sample title, the accuracy of the obtained sample data is improved.

Step 103, for each adjusted sample title, obtaining an image whose content includes the adjusted sample title as sample data.

Specifically, for each adjusted sample title, image scanning, screenshot, image acquisition, and the like may be performed on the adjusted sample title, so as to obtain an image whose content includes the adjusted sample title as sample data.

In an embodiment of the present invention, when sample data is obtained, for each adjusted sample title, the adjusted sample title may be superimposed on a preset background image to obtain a composite image, and an image area including the adjusted sample title in the composite image is captured as the sample data.

The background image may be an image filled with a preset filling color, and the preset filling color may be black, red, blue, or the like. The background image may also be an image with a preset filling effect, and the preset filling effect may be a gradual change effect, a texture effect, or the like. Therefore, sample data obtained by superimposing the sample title on the background image is closer to a real image containing the real title, and the accuracy of the model obtained by training the sample data is higher.

In an embodiment of the present invention, the background image may be obtained according to a background image where the real title is located. Specifically, a commonly used title background image may be directly obtained in the shared database as the background image for superimposing the sample title. And collecting a real image containing the real title, removing the content of the real title in the real image, filling the removed content according to the background pattern in the real image, and taking the filled image as a background image for overlaying the sample title.

The image area may be a rectangular area, a circular area, an elliptical area, or the like. Taking a rectangular area as an example, the image area may be a minimum circumscribed rectangular area of the sample header in the synthesized image, or may be a rectangular area spaced from the sample header by a preset width.

Specifically, the sample title is superimposed on a preset background image, so that the obtained composite image is close to the real image where the real title is located. Then, an image area containing a sample title is cut out from the synthesized image as sample data. Therefore, the content except the sample title contained in the sample data can be reduced, and the redundancy is reduced.

In an embodiment of the present invention, in order to facilitate the subsequent training of the OCR model by using sample data, when an image area including a sample title is cut out from a composite image, an image area where characters of each line in the sample title are located may be respectively cut out, and each cut-out image area is used as sample data. Each sample data obtained in this way only contains one line of characters, and when the OCR model is trained, the OCR model is convenient to recognize the characters in the sample data.

For example, referring to fig. 3, fig. 3 is a schematic diagram of sample data according to an embodiment of the present invention. Wherein, the image filled with the texture effect is a background image, and "Xi vows tough foundation texture polar to guest environmental advance" is a sample title. It can be seen that the above sample titles are distributed in the form of two lines of characters. Since the composite image is obtained by superimposing the sample title on the background image, that is, all the characters in the composite image belong to the sample title, the image region where the characters of each line are located can be respectively intercepted, the region surrounded by each dashed frame in fig. 3 is the image region containing one line of characters, and each dashed frame region is intercepted, so that sample data can be obtained.

In an embodiment of the present invention, when the sample title is superimposed on the preset background image, the sample title may be directly superimposed on the background image in a single-line distribution manner, so that when an image area of a single-line character is cut from the composite image, all characters included in the sample title may be cut.

In one embodiment of the present invention, when obtaining sample data, for each adjusted sample title, an image whose content includes the adjusted sample title may be obtained, and the image is preprocessed with the preprocessed image as the sample data, where the preprocessing is used to make the image tend to include a real image of a real title.

The preprocessing may include a blurring process such as a gaussian blurring process, a motion blurring process, or the like, and may further include a perspective transformation process, a noise addition process, a texture addition process, or the like.

Therefore, the image of the sample data is closer to the image containing the real title, and when the OCR model is trained by using the sample data subsequently, the recognition effect of the trained OCR model on the real image containing the title can be improved, and the robustness of the OCR model is improved.

In an embodiment of the application, for each sample title, the sample title may be first superimposed on a preset background image to obtain a composite image, an image area including the sample title in the composite image is captured, the image area is preprocessed, and the preprocessed image area is used as sample data. The sample data thus obtained more closely resembles a real image containing a real title.

When the scheme provided by the above embodiment is applied to obtain sample data, a plurality of groups of continuous characters are obtained from a text corpus obtained in advance, each group of continuous characters is used as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information represents: the title quantity distribution of sample titles with different character quantities, the first reference information characterizes: and according to the distribution condition of the number of titles of the real titles with different numbers of characters, referring to the character style of the real title, adjusting the character style of the obtained sample title, and aiming at each sample title, obtaining an image of which the content comprises the sample title as sample data. When sample data is obtained, a plurality of groups of continuous characters are directly selected from the text corpus, and the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the sample data obtaining efficiency can be improved.

In an embodiment of the present invention, in a case that sample data obtained by applying the above scheme is used for training an OCR model, the obtained sample data may be used as input, a sample title used for obtaining each sample data is used as a label, an initial model of the OCR model is trained, and finally, the OCR model capable of recognizing the content of the label in the image is obtained. Because the sample data is obtained according to the sample title, the sample title can be directly used as the label during model training, and the title content in the sample data does not need to be identified, so that the labeling cost can be saved.

When the trained OCR model is used to recognize a chinese title, the text corpus in step 101 may be texts contained in a chinese novel, magazine, newspaper, or the like; when the trained OCR model is used to recognize an english caption, the text corpus may be texts contained in an english novel, a magazine, a newspaper, or the like. Similarly, the text corpus may be a japanese text corpus, a russian text corpus, or the like.

In an embodiment of the present invention, for step 101, when obtaining the sample title, a group of consecutive characters may be randomly selected from the text corpus obtained in advance as a sample title. For example, in the case where the text corpus is a novel, a group of continuous characters may be selected at any position in the content of the novel as a sample title. The number of the selected continuous characters may be the number of characters included in the real title, or the number of characters determined based on the distribution of the number of titles of real titles with different numbers of characters in the first reference information.

In an embodiment of the present invention, the text corpus may also be segmented at intervals of a certain character length, and each group of continuous characters obtained by segmentation is used as a sample title. For example, in the case that the text corpus is a magazine, the magazine content may be segmented at intervals of a certain character length, so as to obtain a plurality of groups of characters. Wherein, the character length may be: the length of the number of characters of the characters included in the real title may also be the number of characters determined based on the distribution of the number of titles of the real titles of different numbers of characters in the first reference information.

In an embodiment of the present invention, when the first reference information is represented by a ratio of the number of titles with different numbers of characters in the real title, a plurality of groups of characters with different numbers of characters can be obtained from the pre-obtained text corpus according to a ratio corresponding to the number of characters. For example, assuming that the title number of the real title with the character number of 45 accounts for 30%, a total of 1000 sample data needs to be obtained, that is, a total of 1000 groups of characters need to be obtained, 300 groups of continuous characters with the character number of 45 can be obtained from the text corpus obtained in advance.

In an embodiment of the present invention, when the first reference information is represented by a normal distribution, a random number conforming to the normal distribution may be generated as the number of characters, and then characters satisfying the number of characters may be obtained from a text corpus obtained in advance as a sample title. Each sample title corresponds to a randomly generated number of characters until the number of the obtained sample titles meets the requirement.

In an embodiment of the present invention, multiple groups of continuous characters can be directly selected from the full text of the text corpus to obtain multiple sample titles. And selecting multiple groups of continuous characters from the text corpus according to a preset title selection rule to obtain multiple sample titles. The title selection rule may include at least one of the following rules, which are not specifically limited:

rule 1: the continuous characters do not contain preset punctuation marks. The punctuation mark can be ","; ",". "and the like. The continuous characters selected in the method can be characters belonging to the same sentence, the relevance between the characters is strong, the relevance between the characters of the obtained sample title is also strong, and when sample data obtained by the sample title is used for training an OCR model in the follow-up process, the OCR model obtained by training can well identify the title according to the relevance of the context.

Rule 2: the consecutive characters contain a predetermined title identifier. For example, in the case where the text corpus is a chinese text corpus, the heading identifier may be "yes", "occurrence", or the like, and in the case where the text corpus is an english text corpus, the heading identifier may be "is", "will", or the like. The title identifier may be obtained by counting identifiers that are common in real titles. The sample title thus obtained is closer to the real title.

Rule 3: and selecting a plurality of groups of continuous characters at a preset initial position of the text corpus. The preset starting position may be a starting position of a paragraph, a starting position of a chapter, and the like. In the text of a novel, magazine, newspaper, etc., the title usually appears at the above-mentioned preset starting position. Therefore, a plurality of groups of continuous characters are selected from the preset initial position of the text corpus, and each group of continuous characters is used as a sample title, so that the probability that the obtained sample title is a real title is high, and the accuracy of the sample title can be improved.

In an embodiment of the present invention, for the step 102, when adjusting the character style of the sample title, at least one of the character size, the character font, and the character color of the sample title may be adjusted, which will be described in detail below.

In an embodiment of the present invention, when the character style includes a character size, the obtained character size of the sample title may be adjusted, so that the second distribution information corresponding to the adjusted sample title is consistent with the second reference information.

Wherein the second distribution information represents: the number of titles of sample titles of different character sizes is distributed. The second distribution information may be represented by a ratio of the number of groups of different character sizes to the total number of groups, for example, in the case where the character size is represented by a height value in units of pixels, the second distribution information may be: the number of groups of consecutive characters having a character size of 32 pixels is 25%, the number of groups of consecutive characters having a character size of 48 pixels is 30%, the number of groups of consecutive characters having a character size of 56 pixels is 30%, and the number of groups of consecutive characters having a character size of 64 pixels is 15%. The second distribution information may also be represented by a statistical normal distribution, and specifically, may be determined according to a mean and a variance of character fonts of each group of consecutive characters.

The second reference information characterizes: the distribution of the number of titles of real titles of different character sizes. The second reference information may be represented by a ratio of the number of titles of different character sizes in the real title to the total number of real titles, for example, in the case where the character size is represented by a height value in units of pixels, the second reference information may be: the ratio of the number of titles of a real title with a character size of 20 pixels is 15%, the ratio of the number of titles of a real title with a character size of 30 pixels is 30%, the ratio of the number of titles of a real title with a character size of 40 pixels is 35%, and the ratio of the number of titles of a real title with a character size of 50 pixels is 20%.

The second reference information may also be represented by a normal distribution in statistics, and specifically, the mean and the variance of the character size of each real title may be counted, and then the normal distribution of the number of titles of real titles with different character sizes may be determined according to the mean and the variance, and used as the second reference information. The method for obtaining the second reference information is similar to the method for obtaining the first reference information in step 101, and is not described herein again.

In addition to this, the second distribution information or the second reference information may be represented by a poisson distribution, a binomial distribution, or the like in statistics. The embodiments of the present invention are not limited thereto.

Specifically, the common character size in the real title may be counted in advance to obtain second reference information reflecting the title number distribution of the real titles with different character sizes, and then the character size of the sample title may be adjusted according to the second reference information. Therefore, the character font in the sample title is closer to the character font in the real title, the obtained sample data is closer to the real data, and the accuracy of the obtained sample data is improved

In an embodiment of the present invention, when the second reference information is represented by a ratio of the number of titles with different character sizes in the real title, the character size of the sample title may be adjusted according to a ratio corresponding to the different character sizes. For example, in the case where the character size is represented by a height value in units of pixels, assuming that the number of titles of a real title having a character size of 50 pixels accounts for 30%, and 5000 sample data are available in total, 1500 sample titles may be selected from 5000 groups of sample titles, and the font size of the 1500 selected sample titles may be adjusted to 50 pixels.

In an embodiment of the present invention, when the second reference information is represented by a normal distribution, a random number conforming to the normal distribution may be generated as a character size, and then a sample header may be selected from the sample headers, and the character size of the sample header may be adjusted to the generated character size. Each sample title corresponds to a randomly generated character size until the character sizes of all sample titles are adjusted.

In an embodiment of the present invention, in a case where the character style includes a character font, the obtained character font of the sample title may be adjusted, so that a difference between a proportion occupied by the sample title of the different character fonts after the adjustment and a proportion occupied by the real title of the different character fonts in the real title is reduced.

Specifically, the proportion of the real titles of the different character fonts in the real titles can be counted in advance, and the character fonts of the sample titles are adjusted according to the proportion, so that the proportion of the sample titles of the different character fonts after adjustment is close to the proportion of the real titles of the different character fonts in the real titles. Therefore, the character font in the sample title is closer to the character font in the real title, and the obtained sample data is closer to the real data, so that the accuracy of the obtained sample data is improved.

For example, if 100 real titles are counted to obtain 30 real titles of a round font, 50 real titles of an italian font and 20 real titles of a roman font, it can be known that the real titles of the round font account for 30%, the real titles of the italian font account for 50%, and the real titles of the roman font account for 20%. Therefore, 30% of sample titles can be selected from the sample titles, and the character fonts of the sample titles are adjusted to be round; selecting 50% of sample titles from the sample titles, and adjusting the character fonts of the sample titles into Italian fonts; the character font of the remaining 20% of the sample titles is adjusted to roman.

In an embodiment of the present invention, after the proportion of the real titles of different character fonts in the real titles is obtained through statistics, the proportion may be adjusted manually. Specifically, for a character font with a small proportion of a real title, in order to improve the recognition capability of the trained OCR model for the title of the character font during the subsequent OCR model training, the number of sample data corresponding to the character font needs to be increased, and therefore, the proportion corresponding to the character font can be manually increased. Similarly, for a character font with a large proportion of the real title, in order to balance the number of samples of different character fonts in sample data, the worker may lower the proportion corresponding to the character font.

In addition, the staff manually adds a part of the character fonts and sets a specific gravity for the added character fonts, for example, assuming that only 5 character fonts are included in the counted 100 real titles, the staff knows from experience that other common character fonts exist in the real titles, and the staff can add the other common character fonts and set a specific gravity for the added character fonts. Therefore, the recognition capability of the OCR model obtained by subsequent training on the titles of various character fonts can be realized.

In one embodiment of the present invention, in a case where the character style includes a character color, at least one character color may be selected as a reference color from among character colors existing in a real title; for each sample title, the character color of the sample title is set to at least one color of the reference colors.

Specifically, the common character colors in the real titles, such as red, green, purple, yellow, black, etc., may be counted in advance, and for each sample title, the character color of the sample title may be set to any one of the common character colors. Therefore, the character color in the sample title is closer to the character color in the real title, the obtained sample data is closer to the real data, and the accuracy of the obtained sample data is improved

In one embodiment of the present invention, for each sample title, a color may be randomly selected from common character colors, and the character color of the sample title is set to the selected color. The character color may also be set for each sample title on average. For example, assuming that the character colors common in the real titles include red, blue, and black, the character color of one third of the sample titles may be set to red, the character color of one third of the sample titles may be set to blue, and the character color of one third of the sample titles may be set to black.

In an embodiment of the present invention, for each sample title, the characters included in the sample title may also be set to different character colors, such as a gradient color, a stripe color, and the like.

The character style of the sample title is adjusted, so that the character style of the sample title is close to the character style of the real title, and when sample data is obtained according to the sample title subsequently, the sample data can be closer to the real data, so that the accuracy and the authenticity of the sample data can be improved.

Referring to fig. 4, fig. 4 is a schematic flowchart of another sample data obtaining method according to an embodiment of the present invention, where the method includes the following steps 401 to 406.

Step 401, based on the first normal distribution of the number of titles of the real titles with different numbers of characters, obtaining multiple groups of continuous characters from the text corpus obtained in advance, and taking each group of continuous characters as a sample title.

Among the obtained continuous characters, the distribution of the group number of the continuous characters with different character numbers accords with the first normal distribution.

Step 402, adjusting the character size of the obtained sample title based on the second normal distribution of the number of titles of the real titles with different character sizes.

In the adjusted sample titles, the distribution of the number of titles of sample titles with different character sizes conforms to the second normal distribution.

Step 403, adjusting the character fonts of the obtained sample titles based on the proportion of the real titles of different character fonts in the real titles.

Wherein, the proportion of the sample titles of the different character fonts after adjustment is close to the proportion of the real titles of the different character fonts in the real titles.

And step 404, selecting at least one character color from the character colors of the real titles as a reference color, and setting the character color of each sample title as at least one color in the reference color.

Step 405, for each sample title, superimposing the sample title on a preset background image to obtain a composite image, and intercepting an image area containing the sample title in the composite image.

Step 406, preprocessing each image area, and using the preprocessed image area as sample data.

Where the pre-processing is used to make the image tend to contain real images of real titles.

Referring to fig. 5, fig. 5 is a schematic flow chart of a model training method according to an embodiment of the present invention, where the method includes the following steps 501 to 507.

Step 501, based on the first normal distribution of the number of titles of the real english titles with different numbers of characters, obtaining multiple groups of continuous english characters from the pre-obtained english text corpus, and taking each group of continuous english characters as a sample english title.

The characters contained in the above-mentioned english text corpus are english characters, and the english text corpus can be english novel, english newspaper, etc.

Step 502, based on the second normal distribution of the number of titles of the real english titles with different character sizes, adjusting the character size of the obtained sample english title.

In the adjusted sample titles, the distribution of the number of titles of sample english titles with different character sizes conforms to the second normal distribution.

Step 503, based on the proportion of the real english heading of different character fonts in the real english heading, adjusting the character font of the obtained sample english heading.

The proportion of the sample English titles with different character fonts after adjustment is close to the proportion of the real English titles with different character fonts in the real English titles.

Step 504, selecting at least one character color from the character colors of the real English titles as a reference color, and setting the character color of each sample English title as at least one color in the reference color.

And 505, for each sample english heading, superimposing the sample english heading on a preset background image to obtain a composite image, and capturing an image area containing the sample english heading in the composite image.

Step 506, preprocessing each image area, and taking the preprocessed image area as sample English data.

And 507, training the initial OCR model by taking the sample English data as input and the sample English title as a label to obtain the trained OCR model for identifying the English title.

When the scheme provided by the embodiment is applied to obtain the sample data, a plurality of groups of continuous characters are directly selected from the text corpus, and the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the sample data obtaining efficiency can be improved.

In addition, the number of characters, the size of characters, the font of characters, and the color of characters of the sample title are obtained by referring to the number of characters, the size of characters, the font of characters, and the color of characters of the real title, so that the obtained sample title is close to the real title, and further, sample data obtained according to the sample title is close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the accuracy of the obtained sample data can be improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a sample data obtaining apparatus according to an embodiment of the present invention, the apparatus includes:

a sample header obtaining module 601, configured to obtain multiple groups of consecutive characters from a text corpus obtained in advance, and use each group of consecutive characters as a sample header, where first distribution information corresponding to the obtained sample header is consistent with first reference information, and the first distribution information represents: the title number distribution of sample titles with different character numbers, the first reference information characterizes: the distribution condition of the number of titles of real titles with different numbers of characters;

a character style adjusting module 602, configured to adjust the obtained character style of the sample title with reference to the character style of the real title;

a sample data obtaining module 603, configured to obtain, as sample data, an image whose content includes the adjusted sample title for each adjusted sample title.

In an embodiment of the present invention, when the character style includes a character size, the character style adjusting module 602 is specifically configured to:

In an embodiment of the present invention, when the character style includes a character font, the character style adjusting module 602 is specifically configured to:

In an embodiment of the present invention, when the character style includes a character color, the character style adjusting module 602 is specifically configured to:

In an embodiment of the present invention, the sample data obtaining module 603 is specifically configured to:

An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 complete mutual communication through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the steps of the sample data obtaining method when executing the program stored in the memory 703.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In another embodiment of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when run on a computer, cause the computer to execute the sample data obtaining method described in any one of the above embodiments.

In another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the sample data obtaining method described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are substantially similar to method embodiments and therefore are described with relative ease, as appropriate, with reference to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for obtaining sample data, the method comprising:

adjusting the character style of the obtained sample title based on the character style of the real title;

2. The method according to claim 1, wherein in a case where the character style includes a character size, the adjusting the obtained character style of the sample title with reference to a character style of a real title includes:

3. The method according to claim 1, wherein in a case where the character style includes a character font, the adjusting the character style of the obtained sample title with reference to the character style of the real title includes:

4. The method according to claim 1, wherein in a case where the character style includes a character color, the adjusting the obtained character style of the sample title with reference to the character style of the real title includes:

5. The method according to any one of claims 1-4, wherein the obtaining, for each adjusted sample title, an image whose content includes the adjusted sample title as sample data comprises:

6. The method according to any one of claims 1-4, wherein the obtaining, for each adjusted sample title, an image whose content includes the adjusted sample title as sample data comprises:

7. An apparatus for obtaining sample data, the apparatus comprising:

8. The apparatus of claim 7, wherein, in a case that the character style comprises a character size, the character style adjustment module is specifically configured to:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.