CN116168396A

CN116168396A - Character recognition device and character recognition method

Info

Publication number: CN116168396A
Application number: CN202211328928.3A
Authority: CN
Inventors: 廖国波
Original assignee: Shenzhen Super Times Software Co ltd
Current assignee: Shenzhen Super Times Software Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-05-26

Abstract

The invention provides a character recognition device and a character recognition method, comprising the following steps: s1: acquiring a plurality of video frames; s2: creating a plurality of CPU processes, decoding each video frame by utilizing each CUP process to obtain a plurality of first images, and storing the plurality of first images into a database; s3: sequentially acquiring a first image from the database by utilizing the plurality of CPU processes, and normalizing the size of the first image; s4: determining the position information of the text from each first image in sequence to obtain the position information corresponding to each first image; s5: dividing each first image according to the position information corresponding to each first image to obtain a plurality of image blocks; s6: and carrying out word recognition processing on each image block to obtain a word recognition result, so that important word information in the video can be more clearly known.

Description

Character recognition device and character recognition method

Technical Field

The present invention relates to the field of character recognition, and more particularly, to a character recognition device and a character recognition method.

Background

The character recognition is an important technology in the intelligent recognition technology, characters have the advantage of being convenient for information storage and transmission, so that information can be rapidly spread in time and space, with the development of computer technology, mathematics and image technology, the character recognition is to use a computer dictionary to recognize numbers, english symbols or Chinese characters on the existing media (such as paper and the like) at high speed, a large amount of information content is often contained in video images in electronic equipment such as smart phones and the like, besides image pictures, the character information can be contained in the video images, the character information is usually the display of important information of video playing content, compared with the diversified image information, the character information can be recognized and analyzed, and the content of video playing can be known more easily by combining the obtained character information, so that the character recognition device and the character recognition method are provided.

Disclosure of Invention

In order to solve the above problems, the present invention proposes a text recognition device and a text recognition method, so as to more exactly solve the above problems that a video image in an electronic device such as a smart phone often contains a large amount of information content, and in addition to an image frame, the video image may also contain text information, and the text information is usually the display of important information of video playing content.

The invention is realized by the following technical scheme:

the invention provides a character recognition method, which comprises the following steps:

s1: acquiring a plurality of video frames;

s2: creating a plurality of CPU processes, decoding each video frame by utilizing each CUP process to obtain a plurality of first images, and storing the plurality of first images into a database;

s3: sequentially acquiring a first image from the database by utilizing the plurality of CPU processes, and normalizing the size of the first image;

s4: determining the position information of the text from each first image in sequence to obtain the position information corresponding to each first image;

s5: dividing each first image according to the position information corresponding to each first image to obtain a plurality of image blocks;

s6: and carrying out character recognition processing on each image block to obtain a character recognition result.

Further, the storing the plurality of first images in the database is to store the first images obtained by each CPU process in its shared memory and store the identification information thereof in the first queue.

Further, the size normalization of the first image includes: binarization processing, converting a gray level image into a binary image, and marking the pixel points after the binarization processing as:

G _M*M ＝(P _i，j )(1≤i≤M，1≤J≤N)，

wherein M and N are the length and width of G, P _i，j The pixel points of the ith row and the jth column are defined as P _i，j When=1, a black pixel is represented, when P _i，j When=0, a white pixel is indicated, and the text image is simply denoted as G.

Further, the block obtained after the segmentation of each first image is denoted as g _m*n (Pab∈G，1≤a≤m，，1≤b≤n)，

Wherein m and n are respectively called the length and width of the block, G is a text image, and in the block G, the ratio of the number of pixels with a value of 1 to the number of all pixels is called the gray value of the block, and is expressed as:

P _(s) ＝(∑P _i，1 /(m×n))×100％ (1≤i≤m，i≤j≤m)，

if all the pixel points in g are 1, g is set to 1, and if all the pixel points are 0, g is set to 0.

Further, the text recognition processing is performed on each tile by a feature extraction method, and the feature extraction method is a Bayesian classifier.

Further, the bayesian classifier is divided into character shape and structure features and character content features, wherein the character shape and structure features comprise character height, width, inter-character distance, coverage rate, height-width ratio and longitudinal starting position, and the character content features comprise 16-dimensional direction pixel features.

Further, word segmentation processing is performed after the text result is obtained, a plurality of words are obtained, a target keyword is determined according to the words, a category corresponding to the target keyword is determined according to a preset mapping relation between the keyword and the category, and the category is determined as the category of the video to be classified.

A text recognition device comprising:

the acquisition module is used for: acquiring a plurality of video frames;

and (3) a building module: creating a plurality of CPU processes, decoding each video frame by utilizing each CUP process to obtain a plurality of first images, and storing the plurality of first images into a database;

the processing module is used for: sequentially acquiring first images from the database by using the CPU processes, normalizing the sizes of the first images, sequentially determining the position information of the characters from each first image, and obtaining the position information corresponding to each first image;

and a segmentation module: dividing each first image according to the position information corresponding to each first image to obtain a plurality of image blocks;

and an identification module: performing form matching according to the knowledge of the characters of each country;

and a determination module: and carrying out character recognition processing on each image block to obtain a character recognition result.

Further, the segmentation module includes:

the video frame dividing subunit is used for dividing the video frame into a plurality of target fragments with equal time length according to preset time length and recording the volume of each target fragment;

a first determining subunit, configured to determine, as one of the audio segments in the first content, a plurality of target segments that are continuous and have a volume less than a preset threshold;

and the second determining subunit is used for determining a plurality of continuous target fragments with the volume larger than or equal to the preset threshold value as one audio fragment in the second content.

Further, the determining module includes a display unit, configured to display text information corresponding to the plurality of video frames.

The invention has the beneficial effects that:

according to the character recognition device and the character recognition method, the plurality of video frames are obtained, the video frames are decoded to obtain the plurality of first images, the plurality of first images are stored in the database, the first images are normalized in size, the position information of the characters is determined, each first image is segmented according to the position information of the characters, character recognition is carried out by utilizing feature extraction to obtain the character result, and important character information in the video can be more clearly known.

Drawings

FIG. 1 is a flow chart of a text recognition method according to the present invention;

fig. 2 is a schematic structural diagram of the character recognition device of the present invention.

Detailed Description

In order to more clearly and completely describe the technical scheme of the invention, the invention is further described below with reference to the accompanying drawings.

Referring to fig. 1-2, the present invention provides a text recognition device and a text recognition method.

In this embodiment, the character recognition method includes:

s1: acquiring a plurality of video frames;

s3: sequentially acquiring a first image from a database by using a plurality of CPU processes, and normalizing the size of the first image;

First a plurality of video frames are acquired, for example: the electronic equipment can acquire a video and then decompose the video into a plurality of video frames, so that the electronic equipment can acquire a plurality of video frames; creating a plurality of CPU processes, and decoding each video frame by using each CUP process to obtain a plurality of first images, and storing the plurality of first images into a database, for example: after obtaining the plurality of video frames, the electronic device may create a plurality of CPU processes and perform decoding processing on each video frame by using each CPU process to obtain a plurality of first images, where the number of CPU processes may be less than or equal to the number of video frames, and when the number of CPU processes is less than the number of video frames, for example, assuming that the number of CPU processes is 5 and the number of video frames is 10, the electronic device may obtain the first 5 frames of the 10 video frames by using the 5 CPU processes to perform decoding processing to obtain the plurality of first images, or the electronic device may obtain any 5 frames of the 10 video frames by using the 5 CPU processes to perform decoding processing to obtain the plurality of first images, and store the plurality of first images in a database, for example: when a certain CPU process finishes decoding the acquired video frame, the electronic equipment can utilize the CPU process to store the acquired first image into a database; sequentially acquiring a first image from a database by using a plurality of CPU processes, and performing size normalization on the first image, for example: when at least one first image exists in the database, the electronic equipment can create a CPU process, the CPU process is utilized to sequentially acquire the first images from the database, then the CPU process performs size normalization on each time one first image is acquired, the font size normalization is to perform scaling operation on actually extracted characters, finally a character map with a preset size is obtained, as Chinese character patterns and fonts are various, the characteristics of the same Chinese character are different, in order to facilitate unified description and extraction of the characteristics of the same Chinese character, chinese characters with different patterns and fonts can be identified, a basis is laid for Chinese character identification work, and the acquired first images are required to be subjected to size normalization operation; determining the position information of the text from each first image in turn to obtain the position information corresponding to each first image, and performing position detection processing on the first image by using the CPU process by the electronic device when the CPU process obtains one first image, so as to determine the position information of the text from the first image to obtain the position information corresponding to the first image, wherein the position detection processing on the image can be as follows: detecting areas where text exists in an image to confirm which areas in the image exist text, or performing edge detection, wherein the edge detection mainly exists between an object and a wood board, an object and a background, and areas with different or same colors, the edge detection is an important basis for image segmentation, edges in the image are usually related to discontinuity of image brightness or first derivative of the image brightness, the discontinuity of the image brightness can be divided into step discontinuity and line discontinuity, the electronic device can perform position detection processing on the acquired first image by using a CPU process and using a pre-trained position detection model, so that before obtaining position information corresponding to the acquired first image, the electronic device can perform decoding processing on each video frame in a plurality of video frames by using each CPU process in a plurality of CPU processes, and then perform preprocessing so as to obtain a plurality of first images, and store the first images in a database, and subsequently, the electronic device can sequentially acquire the first images from the database by using the CPU process, wherein the preprocessing on the images can be as follows: converting the format, the size and the like of the image into the format, the size and the like supported by the position detection model; according to the position information corresponding to each first image, dividing each first image to obtain a plurality of image blocks, and each time the position information corresponding to one first image is obtained, dividing the corresponding first image by the electronic device according to the position information to obtain a plurality of image blocks, which can be understood as the position information obtained by the electronic device according to which first image, and then dividing which first image is processed according to the position information, wherein the dividing of the image according to the position information can be as follows: cutting out the area with characters in the image; performing word recognition processing on each image block to obtain a word recognition result, for example: after obtaining the plurality of image blocks, the electronic device can perform word recognition processing on each image block to obtain word recognition results, wherein the word recognition results comprise word recognition results corresponding to each image block, and the plurality of image blocks are not necessarily obtained at the same time, so that each time one image block is obtained, the electronic device can perform word recognition processing on the image block to obtain the word recognition results of the image block, so that the word recognition results of the plurality of second images are finally obtained, and then the electronic device can store the word recognition results and perform video classification, video pushing and the like by utilizing the word recognition results.

In one embodiment, storing the plurality of first images in the database is storing the first images obtained by each CPU process in its shared memory and storing its identification information in the first queue.

In one embodiment, size normalizing the first image includes: binarization processing, converting a gray level image into a binary image, and marking the pixel points after the binarization processing as:

G _m*n ＝(P _i，j )(1≤i≤M.1≤J≤N)，

In one embodiment, the block obtained after the segmentation of each first image is denoted as g _m*n ＝(P _ab ∈G.1≤a≤m，.1≤b≤n)，

P _(s) ＝(∑P _i，1 /(m×n))×100％ (1≤i≤m，1≤j≤m)，

In one embodiment, the text recognition process for each tile is performed by a feature extraction method that uses a Bayesian classifier, pseudo-The components of the fixed feature vector are relatively independent of the decision variable, for a feature vector of x= [ X ] ₁ ，x ₂ ，...x _d ]It belongs to C _i The conditional probability of a class is P (C _i |X)＝P(X|C _i )*P(C)/P(X)＝(P(C _i ) For each class, a conditional probability is calculated, and the final recognition result is the class with the highest conditional probability.

In one embodiment, the bayesian classifier is divided into character shape and structure features and character content features, the character shape and structure features include character height, width, distance between characters, coverage rate, aspect ratio and longitudinal starting position, the character content features include 16-dimensional direction pixel features, in the first feature, feature normalization is required except for coverage rate and aspect ratio, and the most easily occurring errors in character segmentation are found in the study: the components of Chinese characters such as radicals and radicals are cut off individually as English, numerals or punctuation, for example: "eight", "infantile", "Chuan", etc.; english, numbers or punctuation are cut together with Chinese characters.

In an embodiment, word segmentation is performed after a text result is obtained, a plurality of words are obtained, a target keyword is determined according to the words, a category corresponding to the target keyword is determined according to a preset mapping relation between the keyword and the category, and the category is determined as the category of the video to be classified; the text recognition result obtained by the electronic device is merely that text in the image is extracted, and word segmentation is not performed, for example: the text recognition result obtained by the electronic device may be: let children explore graceful magic chinese characters in listening to story and playing games, then the electronic equipment can divide the character recognition result into words, and the obtained multiple divided words can be: let, children, listen, story, play, mid-school, explore, grace, magic, chinese, text. For example, after obtaining a plurality of word segments, the electronic device may determine the category of the video to be classified according to the plurality of word segments, for example, when words such as songs, lyrics, singing, etc. appear for a plurality of times in the plurality of word segments obtained by the electronic device, the electronic device may determine the category of the video to be classified as a song category, and if the electronic device analyzes the lyrics of a song to which the plurality of word segments belong according to the plurality of word segments obtained by the electronic device, the electronic device may determine the category of the video to be classified as a song category; or the electronic equipment can also determine the type of the lyrics, such as the ancient wind type, the popular type, the rock type and the like, and if the electronic equipment determines that the type of the lyrics is the ancient wind type, the electronic equipment can determine the type of the video to be classified as the ancient wind type under the song type; according to the target keywords, determining the category of the video to be classified, for example: after obtaining the plurality of word segments, the electronic device may determine a target keyword from the plurality of word segments, and then the electronic device may determine a category of the video to be classified according to the target keyword, for example: the electronic device may determine the same word from the plurality of words, then the electronic device may determine the number of the same word, and determine the same word with the largest number as the target keyword, if the electronic device obtains 10 words, where the number of the song words is 7, the number of the style words is 2, and the number of the graceful words is 1. Then, the electronic device may determine "songs" as target keywords, so that the electronic device may determine the category of the video to be categorized as a song category; according to the preset mapping relation between the keywords and the categories, determining the category corresponding to the target keywords as the category of the video to be classified, and if the electronic device can preset the preset mapping relation R1 between the keywords and the category, for example: the keyword K1 corresponds to the category C1, the keyword K2 corresponds to the category C2, the keyword K3 corresponds to the category C3, and so on, assuming that the target keyword is K1, then the category corresponding to the target keyword is C1, and therefore, the category of the video to be classified is C1, if the electronic device can preset the preset mapping relationship R2 between the keywords and the category, for example, the keywords K1, K2, K3 corresponds to the category C1, the keywords K4, K5, K6 corresponds to the category C2, the keywords K7, K8, K9 corresponds to the category C3, and so on, and assuming that the target keyword is K3, then the category corresponding to the target keyword is C1, and therefore, the category of the video to be classified is C1.

In this embodiment, the character recognition apparatus includes:

the acquisition module is used for: acquiring a plurality of video frames;

the processing module is used for: sequentially acquiring first images from a database by using a plurality of CPU processes, normalizing the sizes of the first images, sequentially determining the position information of characters from each first image, and obtaining the position information corresponding to each first image;

An acquisition module that acquires a plurality of video frames, for example: the electronic equipment can acquire a video and then decompose the video into a plurality of video frames, so that the electronic equipment can acquire a plurality of video frames; the building module is used for building a plurality of CPU processes, decoding each video frame by utilizing each CUP process to obtain a plurality of first images, and storing the plurality of first images into a database, for example: after obtaining the plurality of video frames, the electronic device may create a plurality of CPU processes and perform decoding processing on each video frame by using each CPU process to obtain a plurality of first images, where the number of CPU processes may be less than or equal to the number of video frames, and when the number of CPU processes is less than the number of video frames, for example, assuming that the number of CPU processes is 5 and the number of video frames is 10, the electronic device may obtain the first 5 frames of the 10 video frames by using the 5 CPU processes to perform decoding processing to obtain the plurality of first images, or the electronic device may obtain any 5 frames of the 10 video frames by using the 5 CPU processes to perform decoding processing to obtain the plurality of first images, and store the plurality of first images in a database, for example: when a certain CPU process finishes decoding the acquired video frame, the electronic equipment can utilize the CPU process to store the acquired first image into a database; the processing module sequentially acquires the first image from the database by using a plurality of CPU processes, and normalizes the size of the first image, for example: when at least one first image exists in the database, the electronic equipment can create a CPU process, the CPU process is utilized to sequentially acquire the first images from the database, then the CPU process performs size normalization on each time one first image is acquired, the font size normalization is to perform scaling operation on actually extracted characters, finally a character map with a preset size is obtained, as Chinese character patterns and fonts are various, the characteristics of the same Chinese character are different, in order to facilitate unified description and extraction of the characteristics of the same Chinese character, chinese characters with different patterns and fonts can be identified, a basis is laid for Chinese character identification work, and the acquired first images are required to be subjected to size normalization operation; determining the position information of the text from each first image in turn to obtain the position information corresponding to each first image, and performing position detection processing on the first image by using the CPU process by the electronic device when the CPU process obtains one first image, so as to determine the position information of the text from the first image to obtain the position information corresponding to the first image, wherein the position detection processing on the image can be as follows: detecting areas where text exists in an image to confirm which areas in the image exist text, or performing edge detection, wherein the edge detection mainly exists between an object and a wood board, an object and a background, and areas with different or same colors, the edge detection is an important basis for image segmentation, edges in the image are usually related to discontinuity of image brightness or first derivative of the image brightness, the discontinuity of the image brightness can be divided into step discontinuity and line discontinuity, the electronic device can perform position detection processing on the acquired first image by using a CPU process and using a pre-trained position detection model, so that before obtaining position information corresponding to the acquired first image, the electronic device can perform decoding processing on each video frame in a plurality of video frames by using each CPU process in a plurality of CPU processes, and then perform preprocessing so as to obtain a plurality of first images, and store the first images in a database, and then the electronic device can sequentially acquire the first images from the database by using the CPU process, wherein the preprocessing on the images can be as follows: converting the format, the size and the like of the image into the format, the size and the like supported by the position detection model; the segmentation module is used for carrying out segmentation processing on each first image according to the position information corresponding to each first image to obtain a plurality of image blocks, the electronic equipment can carry out segmentation processing on the corresponding first image according to the position information to obtain a plurality of image blocks, the segmentation processing on the first image according to the position information can be understood as the position information obtained by the electronic equipment according to the first image, and the segmentation processing on the first image according to the position information can be carried out according to the position information, wherein the segmentation processing on the image can be as follows: cutting out the area with characters in the image; the recognition module is used for carrying out a form matching method according to the knowledge of the characters of each country; the determining module performs word recognition processing on each image block to obtain a word recognition result, for example: after obtaining the plurality of image blocks, the electronic device can perform word recognition processing on each image block to obtain word recognition results, wherein the word recognition results comprise word recognition results corresponding to each image block, and the plurality of image blocks are not necessarily obtained at the same time, so that each time one image block is obtained, the electronic device can perform word recognition processing on the image block to obtain word recognition results of the second image, so that word recognition results of the plurality of second images are finally obtained, then the electronic device can store the word recognition results, and perform video classification, video pushing and the like by utilizing the word recognition results.

In one embodiment, the segmentation module includes:

the video frame is divided into a plurality of target fragments with equal time length according to the preset time length, and the volume of each target fragment is recorded;

a first determining subunit, configured to determine a plurality of consecutive target segments with a volume smaller than a preset threshold as one of the audio segments in the first content;

and the second determining subunit is used for determining a plurality of continuous target fragments with the volume larger than or equal to a preset threshold value as one of the audio fragments in the second content.

In an embodiment, the determining module includes a display unit, configured to display text information corresponding to a plurality of video frames.

The text recognition device in this embodiment of the present application may be an electronic device, or may be a component in an electronic device, for example, an integrated circuit or a chip, where the electronic device may be a terminal, or may be other devices other than a terminal, for example: the electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a mobile internet device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a robot, a wearable device, an ultra mobile personal computer, an internet book or a personal digital assistant, or may be a server, a network attached memory, a personal computer, a television, a teller machine or a self-service machine, which is not limited in this embodiment.

Of course, the present invention can be implemented in various other embodiments, and based on this embodiment, those skilled in the art can obtain other embodiments without any inventive effort, which fall within the scope of the present invention.

Claims

1. The character recognition method is characterized by comprising the following steps:

s1: acquiring a plurality of video frames;

2. The text recognition method of claim 1, wherein storing the plurality of first images in the database is storing the first images obtained by each CPU process in its shared memory and storing the identification information thereof in the first queue.

3. The text recognition method of claim 1, wherein the size normalizing the first image comprises: binarization processing, converting a gray level image into a binary image, and marking the pixel points after the binarization processing as: g _M*N ＝(P _i，j )(1≤i≤M，1≤J≤N)，

4. The text recognition method of claim 1, wherein the block obtained after the segmentation of each first image is denoted as g _m*n ＝(P _a，b ∈G，1≤a≤m，，1≤b≤n)，

P _(g) ＝(∑P _i，j /(m×n))×100％(1≤i≤m，1≤j≤m)，

5. The text recognition method of claim 1, wherein the text recognition processing of each tile is performed by a feature extraction method that utilizes a bayesian classifier.

6. The method of claim 5, wherein the bayesian classifier is divided into character shape, structure-wise features including character height, width, inter-character distance, coverage, aspect ratio, and vertical start position, and character content features including 16-dimensional directional pixel features.

7. The text recognition method according to claim 1, wherein word segmentation is performed after the text result is obtained to obtain a plurality of segmented words, a target keyword is determined according to the segmented words, a category corresponding to the target keyword is determined according to a preset mapping relationship between the keyword and the category, and the category is determined as the category of the video to be classified.

8. The character recognition device is characterized by comprising:

the acquisition module is used for: acquiring a plurality of video frames;

9. The word recognition device of claim 8, wherein the segmentation module comprises:

10. The text recognition device of claim 8, wherein the determination module includes a display unit configured to display text information corresponding to a plurality of video frames.