CN112613572B

CN112613572B - Sample data obtaining method and device, electronic equipment and storage medium

Info

Publication number: CN112613572B
Application number: CN202011608431.8A
Authority: CN
Inventors: 尹天舒
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-01-23
Anticipated expiration: 2040-12-30
Also published as: CN112613572A

Abstract

The embodiment of the invention provides a sample data obtaining method, a sample data obtaining device, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: obtaining a plurality of groups of continuous characters from a pre-obtained text corpus, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: title number distribution of real titles of different character numbers; adjusting the character style of the obtained sample title by referring to the character style of the real title; for each adjusted sample title, an image whose content includes the adjusted sample title is obtained as sample data. By applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

Description

Sample data obtaining method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for obtaining sample data, an electronic device, and a storage medium.

Background

When information such as news, articles, advertisements is audited, it is often necessary to identify the title of the information.

In the prior art, a title of information in a text format is generally scanned to obtain an image containing the title of the information, and character recognition is performed on the scanned image by using an OCR (Optical Character Recognition ) model to obtain the title of the information. Sample data is acquired while training the OCR model. Currently, images of titles containing information are generally collected from everywhere, for example, a web page containing the titles of information is captured, a newspaper or magazine containing the titles of information is image-captured, and the like, so that the images of the titles containing information are collected and the obtained images are used as sample data.

Although the sample data can be obtained by the above method, a lot of manpower and material resources are required to locate the content of the title containing the information, so that the sample data can be obtained, and the sample data obtaining efficiency is low.

Disclosure of Invention

The embodiment of the invention aims to provide a sample data obtaining method, a sample data obtaining device, electronic equipment and a storage medium, so as to improve sample data obtaining efficiency. The specific technical scheme is as follows:

In a first aspect of the present invention, there is provided a sample data obtaining method, the method comprising:

obtaining a plurality of groups of continuous characters from a pre-obtained text corpus, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: title number distribution of real titles of different character numbers;

adjusting the character style of the obtained sample title by referring to the character style of the real title;

for each adjusted sample title, an image whose content includes the adjusted sample title is obtained as sample data.

In one embodiment of the present invention, in the case where the character pattern includes a character size, the adjusting the character pattern of the obtained sample title with reference to the character pattern of the real title includes:

adjusting the character size of the obtained sample title so that second distribution information corresponding to the adjusted sample title is consistent with second reference information, wherein the second distribution information represents: title number distribution of sample titles of different character sizes, the second reference information characterizes: title number distribution of real titles of different character sizes.

In one embodiment of the present invention, in the case that the character style includes a character font, the adjusting the character style of the obtained sample title with reference to the character style of the real title includes:

and adjusting the character fonts of the obtained sample titles, so that the difference between the proportion of the sample titles of different character fonts after adjustment and the proportion of the real titles of different character fonts in the real titles is reduced.

In one embodiment of the present invention, in the case where the character pattern includes a character color, the adjusting the character pattern of the obtained sample title with reference to the character pattern of the real title includes:

selecting at least one character color from character colors existing in a real title as a reference color;

for each sample header, the character color of the sample header is set to at least one of the reference colors.

In one embodiment of the present invention, for each adjusted sample header, obtaining an image whose content includes the adjusted sample header as sample data includes:

and for each adjusted sample title, superposing the adjusted sample title on a preset background image to obtain a composite image, and intercepting an image area containing the adjusted sample title in the composite image as sample data.

for each adjusted sample title, obtaining an image with content containing the adjusted sample title, and preprocessing the image, wherein the preprocessing is used for making the image tend to contain a real image of a real title, and taking the preprocessed image as sample data.

In a second aspect of the present invention, there is also provided a sample data obtaining apparatus, the apparatus comprising:

the sample title obtaining module is used for obtaining a plurality of groups of continuous characters from a pre-obtained text corpus, and taking each group of continuous characters as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: title number distribution of real titles of different character numbers;

the character style adjustment module is used for referring to the character style of the real title and adjusting the character style of the obtained sample title;

And the sample data obtaining module is used for obtaining an image with the content containing the adjusted sample title as sample data for each adjusted sample title.

In one embodiment of the present invention, in the case that the character style includes a character size, the character style adjustment module is specifically configured to:

In one embodiment of the present invention, in the case that the character style includes a character font, the character style adjustment module is specifically configured to:

In one embodiment of the present invention, in the case that the character style includes a character color, the character style adjustment module is specifically configured to:

In one embodiment of the present invention, the sample data obtaining module is specifically configured to:

In a third aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any of the first aspects when executing a program stored on a memory.

In yet another aspect of the present invention, there is also provided a computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform any of the sample data obtaining methods described above.

In yet another aspect of the invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the sample data acquisition methods described above.

When the scheme provided by the embodiment of the invention is applied to obtain sample data, a plurality of groups of continuous characters are obtained from a text corpus obtained in advance, each group of continuous characters is used as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: the title number distribution condition of the real titles of different character numbers refers to the character patterns of the real titles, the character patterns of the obtained sample titles are adjusted, and for each adjusted sample title, an image with the content containing the adjusted sample title is obtained as sample data. When sample data is obtained, a plurality of groups of continuous characters are directly selected from the text corpus, and then the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

In addition, according to the title number distribution situation of the real titles with different character numbers, sample titles with different character numbers, which are consistent with the title number distribution situation, are obtained, the character patterns of the sample titles are adjusted according to the character patterns of the real titles, the obtained sample titles are close to the real titles, and sample data obtained according to the sample titles are close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the accuracy of the obtained sample data can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flow chart of a sample data obtaining method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a real title distribution according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of sample data according to an embodiment of the present invention;

FIG. 4 is a flowchart of another sample data obtaining method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a model training method according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a sample data obtaining apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

Because of the problem of low sample data obtaining efficiency in the prior art, in order to solve the technical problem, the embodiment of the invention provides a sample data obtaining method, a sample data obtaining device, electronic equipment and a storage medium.

In one embodiment of the present invention, there is provided a sample data obtaining method including:

When the scheme provided by the embodiment is applied to obtain sample data, a plurality of groups of continuous characters are directly selected from the text corpus, and then the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment of the invention, the sample data obtaining efficiency can be improved.

In addition, according to the title number distribution situation of the real titles with different character numbers, sample titles with different character numbers, which are consistent with the title number distribution situation, are obtained, the character patterns of the sample titles are adjusted according to the character patterns of the real titles, the obtained sample titles are close to the real titles, and sample data obtained according to the sample titles are close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the accuracy of the obtained sample data can be improved.

In an application scenario of the embodiment of the present invention, sample data obtained by applying the sample data obtaining method may be used for training an OCR model, and the OCR model obtained by training may be used for character recognition of the title content in an image.

The method, the device, the electronic equipment and the storage medium for obtaining sample data provided by the embodiment of the invention are described in detail by the specific embodiment.

Referring to fig. 1, fig. 1 is a flow chart of a sample data obtaining method according to an embodiment of the present invention, where the method can be applied to electronic devices such as a desktop computer, a notebook computer, a tablet computer, and the like, and is not limited in particular.

The method comprises the following steps 101 to 103.

Step 101, obtaining a plurality of groups of continuous characters from a pre-obtained text corpus, and taking each group of continuous characters as a sample title.

The text corpus may be text contained in a novel, a magazine, a newspaper, or the like. The continuous characters may be continuous character strings in the text corpus, and may be english characters, digital characters, kanji characters, etc., such as "Foreigners encouraged to invest in more industries", "blow-face-free willow" etc.

The first distribution information corresponding to the obtained sample header is consistent with the first reference information.

First distribution information characterization: title number distribution of sample titles of different character numbers. The first distribution information may be: the ratio of the number of titles of the sample titles of different character numbers to the number of titles of the total sample title indicates that, for example, the first distribution information may be: the number of titles of sample titles with the number of characters of 20 is 25%, the number of titles of sample titles with the number of characters of 25 is 30%, the number of titles of sample titles with the number of characters of 30 is 30%, and the number of titles of sample titles with the number of characters of 35 is 15%.

In addition, the first distribution information may be represented by a normal distribution in statistics, and in particular, the first distribution information of the sample titles may be obtained according to the mean and variance of the number of characters of each group of sample titles. Thus, when the number of sample titles is large, the distribution of the sample titles can be more clearly represented by normal distribution.

The first reference information characterizes: title number distribution of real titles of different character numbers. The first reference information may be represented by a ratio of the number of titles of different character numbers in the real title to the total number of real titles, for example, the first reference information may be: the title number of the real title of 20 characters is 15%, the title number of the real title of 25 characters is 30%, the title number of the real title of 30 characters is 35%, and the title number of the real title of 35 characters is 20%.

The first reference information may also be represented by a normal distribution in statistics, specifically, a mean and a variance of the number of characters of each real title may be counted, and then a normal distribution of the number of titles of the real titles of different numbers of characters may be determined according to the mean and the variance, as the first reference information.

For example, assuming that statistics are made on 500 real titles, the number of titles of real titles of different character numbers is shown in table 1 below:

TABLE 1

As is clear from table 1, of 500 real titles, the number of titles of the real title having the number of characters 18 is 15, the number of titles of the real title having the number of characters 20 is 35, the number of titles of the real title having the number of characters 22 is 50, and so on, the number of titles of the real title having the number of characters 34 is 10.

Referring to fig. 2, fig. 2 is a schematic diagram of real title distribution according to an embodiment of the present invention. The number of titles of real titles of different character numbers in table 1 is plotted to obtain a bar chart shown in fig. 2 as a schematic diagram of real title distribution. In the bar chart shown in fig. 2, the abscissa indicates the number of characters, and the ordinate indicates the number of titles of real titles of the respective number of characters. As can be seen from fig. 2, the number of characters of the 500 real titles is mostly distributed around 26, and the real titles far below 26 characters or far above 26 characters are fewer, so that the obtained real titles with 500 different numbers of characters conform to normal distribution.

The mean and variance of the number of characters for 500 real titles can be calculated and substituted into a normal distribution formula based on the mean and variance:

Wherein μ represents the mean and σ represents the variance. Substituting the mean and the variance into a normal distribution formula to obtain the title number distribution condition of real titles with different character numbers, thereby obtaining first reference information.

In addition, the first distribution information or the first reference information may be represented by a poisson distribution, a binomial distribution, or the like in statistics. The embodiment of the present invention is not limited thereto.

Specifically, the distribution condition of the number of titles of the real titles with different character numbers can be obtained in advance and used as the first reference information, then the number of titles of the sample titles with different character numbers to be obtained is determined according to the first reference information and the number of preset total sample titles, and continuous characters with different character numbers are obtained from the text corpus according to the determined number of titles, so that the corresponding sample titles with the first distribution information consistent with the first reference information are obtained.

For example, assume that the first reference information is: the title number of the real title of 18 characters is 45%, the title number of the real title of 12 characters is 55%, and the number of the preset total sample title is 20000, it is determined that among the sample titles to be obtained, the title number of the sample title of 18 characters is 9000 and the title number of the sample title of 12 characters is 11000, so that 9000 sets of 18-character-number continuous characters can be obtained from the text corpus, 11000 sets of 12-character-number continuous characters can be obtained, and thus the sample title corresponding to the first distribution information and the first reference information can be obtained.

The first reference information can represent the distribution condition of the number of the titles of the real titles with different character numbers, and the first distribution information corresponding to the obtained sample titles is consistent with the first reference information, namely the obtained character numbers of the characters in each group are close to the number of the characters in the real titles, and the obtained character numbers of the sample titles are close to the number of the characters in the real titles. Thus, the sample data obtained according to the sample title is more similar to the real data, and the accuracy of the obtained sample data is improved.

Step 102, referring to the character style of the real title, adjusting the character style of the obtained sample title.

The character style may include at least one of character font, character size, character color, character spacing, character typesetting mode, etc.

In the case of a chinese character, the character font may be regular script, song Ti, clerical, etc., and in the case of an english character, the character font may be round, italian, roman, etc.

The character size may be expressed in units of pixels, such as 32 pixels by 52 pixels, 40 pixels by 60 pixels, etc., and the character size may be expressed in terms of a width value or a height value in units of pixels, such as 32 pixels, 40 pixels, etc. The character size may also be represented in font No. 4, font No. 2, font No. 17, etc.

The character color may be black, blue, red, etc., and may also be an gradual change, a mottle, a stripe, etc.

The character spacing may be a fixed value such as 20 pounds, 15 pounds, 30 pounds, etc., but may also be a single-time spacing, 1.5-time spacing, 2.0-time spacing, etc.

The character typesetting mode can be horizontal rows, vertical rows and the like.

Specifically, the character style of the real title may be obtained in advance, for example, the character font, the character size, etc. of the real title may be obtained in advance, and the character style of the obtained sample title such as the character font, the character size, etc. may be adjusted with reference to the character style of the real title obtained in advance, so that the character style of the sample title approaches the character style of the real title, thereby improving the accuracy of the obtained sample data when the sample data is obtained from the sample title.

Step 103, for each adjusted sample title, obtaining an image containing the adjusted sample title as sample data.

Specifically, for each adjusted sample title, image scanning, screenshot, image acquisition, etc. may be performed on the adjusted sample title, so as to obtain an image whose content includes the adjusted sample title as sample data.

In one embodiment of the present invention, when sample data is obtained, for each adjusted sample title, the adjusted sample title may be superimposed on a preset background image to obtain a composite image, and an image area including the adjusted sample title in the composite image is taken as sample data.

The background image may be an image filled with a preset filling color, and the preset filling color may be black, red, blue, or the like. The background image may also be an image having a preset filling effect, and the preset filling effect may be a gradual effect, a texture effect, or the like. Thus, the sample data obtained by superposing the sample title on the background image is closer to the real image containing the real title, and the accuracy of the model obtained by training the sample data is higher.

In one embodiment of the present invention, the background image may be obtained from a background image where a real title is located. Specifically, a common title background image can be obtained directly in the shared database as a background image for superimposing the sample title. The method can also collect real images containing real titles, reject the contents of the real titles in the real images, fill the rejected contents according to background patterns in the real images, and take the filled images as background images for overlapping sample titles.

The image area may be a rectangular area, a circular area, an elliptical area, or the like. Taking a rectangular area as an example, the image area may be a minimum circumscribed rectangular area of a sample header in the composite image, or may be a rectangular area spaced apart from the sample header by a preset width.

Specifically, the sample title is superimposed on a preset background image, so that the obtained composite image is close to the real image where the real title is located. And then, cutting out an image area containing the sample title from the composite image as sample data. This can reduce contents included in the sample data except for the sample header, and reduce redundancy.

In one embodiment of the present invention, in order to facilitate training of the OCR model by using the sample data later, when capturing the image area including the sample header from the composite image, the image area where the characters of each line in the sample header are located may be captured separately, and each captured image area may be used as sample data. Each sample data obtained in this way only contains one line of characters, and when training the OCR model, the OCR model is convenient to recognize the characters in the sample data.

For example, referring to fig. 3, fig. 3 is a schematic diagram of sample data according to an embodiment of the present invention. Wherein the image filled with the texture effect is a background image, "Xi vows tough battle against pollution to boost ecological advancement" is a sample header. It can be seen that the sample headers described above are distributed in two rows of characters. Because the composite image is obtained by superposing the sample title on the background image, that is, all characters in the composite image belong to the sample title, the image area where the characters of each row are located can be intercepted respectively, the area surrounded by each dotted line frame in fig. 3 is the image area containing one row of characters, and the sample data can be obtained by intercepting each dotted line frame area.

In one embodiment of the present invention, when the sample header is superimposed on the preset background image, the sample header may be directly superimposed on the background image in a form of single-line distribution, so that when the image area of the single-line character is truncated from the composite image, all the characters contained in the sample header may be truncated.

In one embodiment of the present invention, when sample data is obtained, for each adjusted sample title, an image whose content contains the adjusted sample title may be obtained, and the image may be preprocessed, with the preprocessed image being taken as sample data, wherein the preprocessing is used to make the image tend to contain a real image of a real title.

The preprocessing may include blurring processing such as gaussian blurring processing, motion blurring processing, and the like, and may also include perspective transformation processing, noise addition processing, texture addition processing, and the like.

Therefore, the image of the sample data is more similar to the image containing the real title, and when the OCR model is trained by using the sample data later, the recognition effect of the OCR model obtained by training on the real image containing the title can be improved, and the robustness of the OCR model is improved.

In one embodiment of the present application, for each sample header, the sample header may be first superimposed on a preset background image to obtain a composite image, and an image area including the sample header in the composite image may be intercepted, and the image area may be preprocessed, where the preprocessed image area is used as sample data. The sample data thus obtained more closely approximates a real image containing a real title.

When the scheme provided by the embodiment is applied to obtain sample data, a plurality of groups of continuous characters are obtained from a text corpus obtained in advance, each group of continuous characters is used as a sample title, wherein first distribution information corresponding to the obtained sample title is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: the title number distribution of the real titles of different character numbers, the character style of the obtained sample title is adjusted with reference to the character style of the real title, and for each sample title, an image containing the sample title in content is obtained as sample data. When sample data is obtained, a plurality of groups of continuous characters are directly selected from the text corpus, and then the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the sample data obtaining efficiency can be improved.

In one embodiment of the present invention, when the sample data obtained by applying the above scheme is used for training an OCR model, the obtained sample data may be used as input, and the sample header for obtaining each sample data is used as a label, and the initial model of the OCR model is trained, so as to finally obtain the OCR model capable of identifying the content of the header in the image. Because the sample data is obtained according to the sample title, the sample title can be directly used as the label during model training, and the content of the title in the sample data is not required to be identified, so that the labeling cost can be saved.

When the trained OCR model is used to recognize the chinese headline, the text corpus in step 101 may be the text contained in the chinese novel, magazine, newspaper, etc.; when the trained OCR model is used to recognize english titles, the text corpus may be text contained in a novel, magazine, newspaper, or the like. Similarly, the text corpus may be a japanese text corpus, a russian text corpus, or the like.

In one embodiment of the present invention, for the step 101, when obtaining a sample header, a group of continuous characters may be randomly selected from a text corpus obtained in advance as a sample header. For example, in the case where the text corpus is a novel, a set of consecutive characters may be selected at any position in the novel content as a sample header. The number of characters of the selected continuous characters may be the number of characters of the characters contained in the real title, or may be the number of characters determined based on the title number distribution condition of the real title of different numbers of characters in the first reference information.

In one embodiment of the invention, the text corpus can be segmented at intervals of a certain character length, and each group of continuous characters obtained by segmentation is used as a sample title. For example, in the case where the text corpus is a magazine, the magazine content may be segmented at intervals of a certain character length, so as to obtain a plurality of groups of characters. Wherein, the character length may be: the length of the number of characters of the characters contained in the real title may be the number of characters determined based on the distribution of the number of titles of the real titles of different numbers of characters in the first reference information.

In one embodiment of the present invention, when the first reference information is represented by the proportion of the number of titles with different numbers of characters in the real title, multiple groups of characters with different numbers of characters can be obtained from the text corpus obtained in advance according to the proportion corresponding to the different numbers of characters. For example, assuming that the proportion of the number of titles of a real title having a number of characters of 45 is 30%, 1000 pieces of sample data in total, that is, 1000 sets of characters in total, are required to be obtained, 300 sets of continuous characters having a number of characters of 45 can be obtained from a text corpus obtained in advance.

In one embodiment of the present invention, when the first reference information is represented by a normal distribution, a random number conforming to the normal distribution may be generated as the number of characters, and then characters satisfying the number of characters may be obtained from a text corpus obtained in advance as a sample header. Each sample header corresponds to a randomly generated number of characters until the number of sample headers obtained meets the requirements.

In one embodiment of the invention, multiple groups of continuous characters can be directly selected from the text corpus to obtain multiple sample titles. And multiple groups of continuous characters can be selected from the text corpus according to a preset title selection rule, so that multiple sample titles are obtained. The title selection rule may include at least one of the following rules, and is not limited in particular:

Rule 1: the consecutive characters do not contain preset punctuation marks. The punctuation marks can be ","; ",". "etc. The selected continuous characters can be characters belonging to the same sentence, the characters have strong relevance, the characters of the obtained sample title have strong relevance, and when the OCR model is trained by using sample data obtained by the sample title subsequently, the OCR model obtained by training can better identify the title according to the relevance of the context.

Rule 2: the consecutive characters contain a preset title identifier. For example, the heading identifier may be "yes", "occurrence", or the like in the case where the text corpus is a chinese text corpus, or "is", "will", or the like in the case where the text corpus is an english text corpus. The title identifier may be obtained by counting identifiers commonly found in real titles. The sample title thus obtained is more close to the real title.

Rule 3: and selecting a plurality of groups of continuous characters at the preset starting position of the text corpus. The preset starting position may be a starting position of a paragraph, a starting position of a chapter, etc. In the text of novels, magazines, newspapers, etc., titles generally appear at the above-mentioned preset starting positions. Therefore, a plurality of groups of continuous characters are selected from the preset starting position of the text corpus, each group of continuous characters is used as a sample title, the probability that the obtained sample title is a real title is high, and the accuracy of the sample title can be improved.

In one embodiment of the present invention, for the step 102, at least one of the character size, the character font and the character color of the sample header may be adjusted when the character style of the sample header is adjusted, which will be described in detail below.

In one embodiment of the present invention, in the case where the character pattern includes a character size, the character size of the obtained sample header may be adjusted such that the second distribution information corresponding to the adjusted sample header coincides with the second reference information.

Wherein, the second distribution information characterizes: title number distribution of sample titles of different character sizes. The second distribution information may be expressed as a ratio of the number of groups of different character sizes to the total number of groups, for example, in the case where the character sizes are expressed with height values in units of pixels, the second distribution information may be: the number of groups of consecutive characters of 32 pixels in character size is 25%, the number of groups of consecutive characters of 48 pixels in character size is 30%, the number of groups of consecutive characters of 56 pixels in character size is 30%, and the number of groups of consecutive characters of 64 pixels in character size is 15%. The second distribution information may also be represented by a normal distribution in statistics, and in particular may be determined based on the mean and variance of the character fonts of each set of consecutive characters.

The second reference information characterizes: title number distribution of real titles of different character sizes. The above-mentioned second reference information may be represented by a ratio of the number of titles of different character sizes in the real titles to the total number of real titles, for example, in the case where the character sizes are represented by height values in units of pixels, the second reference information may be: the number of titles of real titles having a character size of 20 pixels is 15%, the number of titles of real titles having a character size of 30 pixels is 30%, the number of titles of real titles having a character size of 40 pixels is 35%, and the number of titles of real titles having a character size of 50 pixels is 20%.

The second reference information may also be represented by a normal distribution in statistics, specifically, a mean and a variance of the character sizes of each real title may be counted, and then a normal distribution of the number of titles of the real titles with different character sizes may be determined according to the mean and the variance, as the second reference information. The method for obtaining the second reference information is similar to the method for obtaining the first reference information in step 101, and will not be described herein.

In addition, the second distribution information or the second reference information may be represented by a poisson distribution, a binomial distribution, or the like in statistics. The embodiment of the present invention is not limited thereto.

Specifically, the common character sizes in the real titles can be counted in advance to obtain second reference information reflecting the distribution condition of the number of the titles of the real titles with different character sizes, and then the character sizes of the sample titles are adjusted according to the second reference information. Thus, the character fonts in the sample title are more similar to the character fonts in the real title, and the obtained sample data are more similar to the real data, so that the accuracy of the obtained sample data is improved

In one embodiment of the present invention, when the second reference information is represented by the proportion of the number of titles with different character sizes in the real title, the character sizes of the sample title may be adjusted according to the proportion corresponding to the different character sizes. For example, in the case where the character size is expressed by a height value in units of pixels, assuming that the proportion of the number of titles of real titles having a character size of 50 pixels is 30% and 5000 sample data are total, 1500 sample titles among 5000 sample titles may be selected, and the font size of the selected 1500 sample titles may be adjusted to 50 pixels.

In one embodiment of the present invention, when the second reference information is represented by a normal distribution, a random number conforming to the normal distribution may be generated as a character size, and then one sample header is selected from the sample headers, and the character size of the sample header is adjusted to the generated character size. Each sample header corresponds to a randomly generated character size until the character sizes of all sample headers are adjusted.

In one embodiment of the present invention, in the case that the character pattern includes a character font, the character font of the obtained sample header may be adjusted so that the difference between the proportion of the sample header of the adjusted different character font and the proportion of the real header of the different character font in the real header is reduced.

Specifically, the proportion of the real titles of different character fonts in the real titles can be counted in advance, and the character fonts of the sample titles are adjusted according to the proportion, so that the proportion of the sample titles of the adjusted different character fonts is close to the proportion of the real titles of the different character fonts in the real titles. Therefore, the character fonts in the sample title are more similar to the character fonts in the real title, and the obtained sample data are more similar to the real data, so that the accuracy of the obtained sample data is improved.

For example, assuming that 100 real titles are counted to obtain 30 real titles of the round words, 50 real titles of the italian words and 20 real titles of the roman words, it can be known that the real titles of the round words have a specific gravity of 30%, the real titles of the italian words have a specific gravity of 50% and the real titles of the roman words have a specific gravity of 20%. Thus, for sample titles, 30% of sample titles may be selected therefrom, and the character fonts of such sample titles may be adjusted to be round characters; selecting 50% of sample titles from the above, and adjusting character fonts of the sample titles to italian; the character font of the remaining 20% of the sample titles is adjusted to roman.

In one embodiment of the invention, after the proportion of the real titles of different character fonts in the real titles is obtained through statistics, the proportion can be manually adjusted. Specifically, for a character font with a smaller specific gravity of the real title, in order to improve the recognition capability of the OCR model obtained by training on the title of the character font when training is performed on the OCR model later, the number of sample data corresponding to the character font needs to be increased, so that the specific gravity corresponding to the character font can be manually increased. Likewise, for a character font with a larger specific gravity of the real title, in order to equalize the number of sample numbers of different character fonts in the sample data, the worker may lower the specific gravity corresponding to the character font.

In addition, the staff manually adds a part of character fonts and sets the specific gravity for the added character fonts, for example, assuming that the counted 100 real titles only contain 5 character fonts, the staff can know that other common character fonts exist in the real titles according to experience, and the staff can add other common character fonts and set the specific gravity for the added character fonts. Thus, the recognition capability of the OCR model obtained by subsequent training on titles of various character fonts can be realized.

In one embodiment of the present invention, in the case where the character pattern includes character colors, at least one character color may be selected as a reference color from among character colors existing in the real title; for each sample title, the character color of the sample title is set to at least one of the reference colors.

Specifically, the common character colors in the real titles, such as red, green, purple, yellow, black, etc., may be counted in advance, and for each sample title, the character color of the sample title may be set to any one of the common character colors. The color of the characters in the sample title is more similar to that of the characters in the real title, so that the obtained sample data is more similar to that of the real data, and the accuracy of the obtained sample data is improved

In one embodiment of the present invention, for each sample header, a color may be randomly selected from among common character colors, and the character color of the sample header is set to the selected color. Character colors may also be set for each sample header on average. For example, assuming that the common character colors in the real titles include red, blue, black, one third of the sample titles may be set to red, one third of the sample titles to blue, and one third of the sample titles to black.

In one embodiment of the present invention, the respective characters contained in each sample header may also be set to different character colors, such as an gradation, a stripe, or the like, for each sample header.

Therefore, the character style of the sample title is adjusted, so that the character style of the sample title is close to the character style of the real title, and the sample data can be more close to the real data when the sample data is obtained according to the sample title later, thereby improving the accuracy and the authenticity of the sample data.

Referring to fig. 4, fig. 4 is a flowchart of another sample data obtaining method according to an embodiment of the present invention, where the method includes the following steps 401 to 406.

Step 401, obtaining a plurality of groups of continuous characters from a text corpus obtained in advance based on a first normal distribution of the number of titles of real titles of different character numbers, and taking each group of continuous characters as a sample title.

Wherein, among the obtained continuous characters, the distribution of the group number of the continuous characters of different character numbers conforms to the first normal distribution described above.

Step 402, adjusting the character size of the obtained sample title based on the second normal distribution of the title number of the real titles with different character sizes.

Wherein, the distribution of the number of the sample titles with different character sizes among the adjusted sample titles conforms to the second normal distribution.

And step 403, adjusting the character fonts of the obtained sample title based on the proportion of the real titles of different character fonts in the real titles.

The proportion of the sample titles of the different character fonts after adjustment is close to that of the real titles of the different character fonts in the real titles.

At step 404, at least one character color is selected from the character colors existing in the real titles as a reference color, and the character color of each sample title is set as at least one color of the reference colors.

Step 405, for each sample title, superimpose the sample title on a preset background image to obtain a composite image, and intercept an image area containing the sample title in the composite image.

In step 406, preprocessing is performed on each image area, and the preprocessed image area is used as sample data.

Wherein the preprocessing is used to make the image tend to contain the real image of the real title.

Referring to fig. 5, fig. 5 is a schematic flow chart of a model training method according to an embodiment of the present invention, and the method includes the following steps 501 to 507.

Step 501, based on the first normal distribution of the number of titles of real english titles of different numbers of characters, obtaining a plurality of groups of continuous english characters from the english text corpus obtained in advance, and taking each group of continuous english characters as one sample english title.

The characters contained in the english text corpus are english characters, and the english text corpus may be english novels, english newspapers, etc.

Step 502, adjusting the character size of the obtained sample english title based on the second normal distribution of the title number of the real english titles with different character sizes.

Wherein, the distribution of the number of the titles of the sample English titles with different character sizes in the adjusted sample titles accords with the second normal distribution.

Step 503, based on the proportion of the real English titles of different character fonts in the real English titles, the character fonts of the obtained sample English titles are adjusted.

The proportion of the sample English titles of the different character fonts after adjustment is close to that of the real English titles of the different character fonts in the real English titles.

Step 504, selecting at least one character color from the character colors existing in the real English titles as a reference color, and setting the character color of each sample English title as at least one color in the reference color.

Step 505, for each sample english title, superimpose the sample english title on a preset background image to obtain a composite image, and intercept an image area containing the sample english title in the composite image.

Step 506, preprocessing each image area, and taking the preprocessed image area as sample English data.

In step 507, training the initial OCR model by taking the sample english data as input and the sample english title as label, to obtain a trained OCR model for identifying the english title.

When the scheme provided by the embodiment is applied to obtain sample data, a plurality of groups of continuous characters are directly selected from the text corpus, and then the sample data can be obtained according to the selected characters, so that the consumption of manpower and material resources is low. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the sample data obtaining efficiency can be improved.

In addition, the number of characters, the character size, the character font and the character color of the sample title are obtained by referring to the number of characters, the character size, the character font and the character color of the real title, so that the obtained sample title is close to the real title, and further sample data obtained according to the sample title is close to the real data. Therefore, by applying the sample data obtaining scheme provided by the embodiment, the accuracy of the obtained sample data can be improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a sample data obtaining apparatus according to an embodiment of the present invention, where the apparatus includes:

the sample header obtaining module 601 is configured to obtain a plurality of groups of continuous characters from a text corpus obtained in advance, and use each group of continuous characters as a sample header, where first distribution information corresponding to the obtained sample header is consistent with first reference information, and the first distribution information is characterized in that: title number distribution of sample titles of different character numbers, the first reference information characterizes: title number distribution of real titles of different character numbers;

a character style adjustment module 602, configured to adjust a character style of the obtained sample title with reference to a character style of the real title;

the sample data obtaining module 603 is configured to obtain, for each adjusted sample title, an image whose content includes the adjusted sample title as sample data.

In one embodiment of the present invention, in the case that the character style includes a character size, the character style adjustment module 602 is specifically configured to:

In one embodiment of the present invention, in the case that the character style includes a character font, the character style adjustment module 602 is specifically configured to:

In one embodiment of the present invention, in the case that the character style includes a character color, the character style adjustment module 602 is specifically configured to:

In one embodiment of the present invention, the sample data obtaining module 603 is specifically configured to:

The embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 perform communication with each other through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the steps of the sample data obtaining method when executing the program stored in the memory 703.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which instructions are stored, which when run on a computer, cause the computer to perform the sample data obtaining method according to any one of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the sample data obtaining method of any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus embodiments, the electronic device embodiments, the computer-readable storage medium embodiments, the computer program product embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, and relevant places are referred to in the partial description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of obtaining sample data, the method comprising:

based on the character style of the real title, adjusting the character style of the obtained sample title;

for each adjusted sample title, obtaining an image with content containing the adjusted sample title as sample data, wherein the image contains a sub-image corresponding to the adjusted sample title;

in the case that the character style includes a character size, the adjusting the character style of the obtained sample title based on the character style of the real title includes:

2. The method according to claim 1, wherein, in the case where the character style includes a character font, the adjusting the character style of the obtained sample title based on the character style of the real title includes:

3. The method according to claim 1, wherein, in the case where the character pattern includes a character color, the adjusting the character pattern of the obtained sample title based on the character pattern of the real title includes:

4. A method according to any one of claims 1-3, wherein for each adjusted sample title, obtaining an image containing the adjusted sample title as sample data comprises:

5. A method according to any one of claims 1-3, wherein for each adjusted sample title, obtaining an image containing the adjusted sample title as sample data comprises:

6. A sample data obtaining apparatus, the apparatus comprising:

a sample data obtaining module, configured to obtain, for each adjusted sample title, an image whose content includes the adjusted sample title, as sample data, where the image includes a sub-image corresponding to the adjusted sample title;

in the case that the character style includes a character size, the character style adjustment module is specifically configured to:

7. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.