CN115565109A

CN115565109A - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115565109A
Application number: CN202211277989.1A
Authority: CN
Inventors: 吴嘉嘉; 殷兵; 胡金水; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-01-03

Abstract

The application provides a text recognition method, a text recognition device, electronic equipment and a storage medium, which can be used for recognizing text contents aiming at a plurality of image frame sequences with the same text contents, reducing the interference of other image frames without the same text contents and improving the text recognition accuracy in a video scene; moreover, when the current frame is subjected to text recognition, the position characteristic, the semantic characteristic, the visual characteristic and other mode information of the previous frame of picture can be combined, so that the information amount in the text recognition process is improved, and the text recognition accuracy in a video scene is further improved.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of text recognition technologies, and in particular, to a text recognition method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of text recognition technology, the accuracy of text recognition for pictures is remarkably improved. However, the current research on text recognition for videos is still in the early stage, and the method of extracting key frames in videos and then using the key frames for text recognition is generally adopted, and the text recognition accuracy is not high. Therefore, how to improve the text recognition accuracy in a video scene is a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

Based on the above requirements, the application provides a text recognition method, a text recognition device, an electronic device and a storage medium, and the method can improve the text recognition accuracy in a video scene.

The technical scheme provided by the application is as follows:

in one aspect, the present application provides a text recognition method, including:

s1, extracting text features of an Nth image frame from an image frame sequence with the same text content; wherein N is a positive integer;

s2, performing text recognition on the (N + 1) th image frame in the image frame sequence based on the text characteristics of the (N) th image frame to obtain a recognition text corresponding to the (N + 1) th image frame;

and S3, enabling N = N +1, and repeatedly executing the step S1 and the step S2 until N is equal to the number of the image frame sequence, and determining the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame.

Further, in the method described above, if N =1, the method further includes:

and performing text recognition on a first image frame in the image frame sequence to obtain a recognition text corresponding to the first image frame.

Further, in the method described above, the text feature of the nth image frame includes at least one of a position feature, a semantic feature and a visual feature of the text in the nth image frame.

Further, in the method described above, extracting a text feature of an nth image frame from the nth image frame in an image frame sequence having the same text content includes:

extracting feature information from an Nth image frame in an image frame sequence with the same text content, wherein the feature information comprises at least one of a position feature, a semantic feature and a visual feature of a text;

and fusing the extracted feature information to obtain the text feature of the Nth image frame.

Further, in the method described above, the extracting text features of the nth image frame from the nth image frame in the image frame sequence with the same text content, and performing text recognition on the (N + 1) th image frame in the image frame sequence based on the text features of the nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame includes:

inputting an image frame sequence with the same text content into a pre-trained text recognition model, so that the text recognition model extracts text features of an Nth image frame from the image frame sequence, and performing text recognition on an (N + 1) th image frame in the image frame sequence based on the text features of the Nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame.

Further, in the method described above, determining the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame includes:

detecting the confidence coefficient of the identification text corresponding to each image frame;

and determining the recognition text with the highest confidence coefficient as the recognition text corresponding to the image frame sequence.

Further, in the method described above, the method further includes: the image frame sequences with the same text content are extracted from the video.

Further, in the method described above, the process of extracting the image frame sequences with the same text content includes:

extracting video frames with text content coincidence degree higher than a set coincidence degree threshold value from the video to form a video frame sequence;

and extracting image frame sequences with the same text content from the video frame sequences.

Further, in the method described above, extracting a video frame with a text content coincidence degree higher than a set coincidence degree threshold from the video includes:

performing text recognition on the set characters of each text group of each video frame in the video to obtain recognition characters corresponding to each text group;

traversing the identification characters corresponding to each text group in the adjacent video frames, and determining the number of target text groups with the same identification characters in the adjacent video frames;

and if the number of the target text groups in the adjacent video frames reaches a set condition, determining that the adjacent video frames are the video frames with the text content coincidence degree higher than a set coincidence degree threshold value.

Further, in the above method, if the number of target text groups in the adjacent video frames reaches a set condition, determining that the adjacent video frames are video frames with text content overlap ratio higher than a set overlap ratio threshold, includes:

calculating the ratio of the number of the target text groups in the adjacent video frames to the maximum value of the number of the text groups in the adjacent video frames;

if the ratio is larger than the set value, the number of the target text groups in the adjacent video frames reaches the set condition, and the adjacent video frames are determined to be the video frames with the text content coincidence degree higher than the set coincidence degree threshold value.

Further, in the method described above, the extracting a sequence of image frames with the same text content from the sequence of video frames includes:

respectively identifying texts in adjacent video frames, wherein the distance between the texts and the target text group is within a set distance threshold range;

and if the texts in the adjacent video frames, the distances between which and the target text group are within a set distance threshold range, are the same, extracting the area where the target text group is located from the video frame sequence to form the image frame sequence.

Further, in the method described above, the method further includes: and combining the identification texts corresponding to all the image frame sequences with the same text content in the video to obtain the identification texts corresponding to the video.

On the other hand, the present application also provides a text recognition apparatus, including:

the extraction module is used for executing the step S1 and extracting the text characteristics of the Nth image frame from the Nth image frame in the image frame sequence with the same text content; wherein N is a positive integer;

the identification module is used for executing the step S2, and performing text identification on the (N + 1) th image frame in the image frame sequence based on the text characteristics of the (N) th image frame to obtain an identification text corresponding to the (N + 1) th image frame;

and the repeating module is used for executing the step S3, enabling N = N +1, controlling the extraction module to repeatedly execute the step S1 and controlling the identification module to repeatedly execute the step S2 until N is equal to the number of the image frame sequences, and determining the identification texts corresponding to the image frame sequences according to the identification texts corresponding to each image frame.

In another aspect, the present application further provides an electronic device, including:

a memory and a processor;

wherein the memory is used for storing programs;

the processor is configured to implement the text recognition method according to any one of the above items by running a program in the memory.

In another aspect, the present application further provides a storage medium, including: the storage medium has stored thereon a computer program which, when executed by a processor, implements the text recognition method of any of the above.

The text recognition method provided by the application can be used for recognizing the text content aiming at a plurality of image frame sequences with the same text content, can be used for recognizing the text content aiming at the image frame sequences with the same text content, reduces the interference of other image frames without the same text content, and improves the text recognition accuracy in a video scene; in addition, when the current frame is subjected to text recognition, the position characteristic, the semantic characteristic, the visual characteristic and other mode information of the previous frame of picture can be combined, so that the information amount in the text recognition process is increased, and the text recognition accuracy in a video scene is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of extracting a text feature of an nth image frame according to an embodiment of the present disclosure;

FIG. 3 is a schematic processing flow diagram of a text recognition model for a set of training samples according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of extracting a sequence of image frames from a video according to an embodiment of the present application;

FIG. 5 is a schematic flowchart of extracting a sequence of video frames according to an embodiment of the present application;

FIG. 6 is a diagram illustrating text recognition of set characters according to an embodiment of the present application;

FIG. 7 is a diagram of a verification target text group provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Summary of the application

The technical scheme of the embodiment of the application is suitable for the application scene of text recognition for the video, and the text recognition accuracy in the video scene can be improved by adopting the technical scheme of the embodiment of the application.

In recent years, text recognition technology has been rapidly developed, and the accuracy of text recognition on pictures has been significantly improved. For example, using Optical Character Recognition (OCR) technology, text in a picture can be recognized with high accuracy.

However, current research on text recognition for videos is still in the early stages. In the prior art, when text content in a video is identified, a method is generally adopted to extract a key frame in the video, convert a problem of identifying the text content in the video into a problem of identifying the text content in the key frame, and perform text identification on the key frame by using a technology of performing text identification on a picture to obtain an identification text. Therefore, the accuracy of text recognition in video is currently high or low depending on whether high quality key frames are extracted. That is, if a clear, complete key frame is extracted, the accuracy of text recognition in the video is higher. However, compared with a static picture, the text in a dynamic video may have changes in size, angle, position, definition, background, etc., so that it is difficult to extract high-quality key frames, which affects the accuracy of text recognition.

Based on this, the application provides a text recognition method, a text recognition device, an electronic device and a storage medium, and the technical scheme can perform text recognition on a plurality of image frame sequences with the same text content, so that the text recognition accuracy in a video scene is improved, and the multi-mode information of the previous frame of picture can be combined when the current frame is subjected to text recognition, so that the information amount in the text recognition process is improved, and the text recognition accuracy in the video scene is further improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Exemplary method

The embodiment of the application provides a text recognition method, which can be executed by an electronic device, wherein the electronic device can be any device with a data and instruction processing function, such as a computer, an intelligent terminal, a server and the like. Referring to fig. 1, the method includes:

s101, extracting text features of an Nth image frame from the Nth image frame in the image frame sequence with the same text content.

The image frame sequence comprises a plurality of image frames, and the text content of the image frames in the image frame sequence is the same. An image frame may be a complete video frame or a portion of an image in a frame of video. The video frame refers to the minimum unit constituting a video, one frame of video is a still image, and consecutive video frames form a dynamic video.

The sequence of image frames is extracted from a video containing text content to be recognized. In this embodiment, the language and the font of the text content to be recognized are not limited, for example, the language of the text content to be recognized may be chinese, english, or the like, and the font of the text content to be recognized may be a handwritten font, or may be a printed song style, a regular script, a black body, an clerical script, or the like.

The electronic device executing the text recognition method of the embodiment can download and acquire a video containing text content to be recognized from a server; the video containing the text content to be identified can also be downloaded and acquired from an intermediate storage device, wherein the intermediate storage device is any device with a storage function, such as a U disk, a memory card and the like; if the electronic device executing the text recognition method of the embodiment includes a camera, a video including text content to be recognized may be acquired through the camera of the electronic device, which is not limited in the embodiment.

After a video containing text content to be identified is acquired, an image frame sequence with the same text content is extracted from the video. Specifically, the present embodiment analyzes each frame of video frame in the video, and captures a sequence of image frames with the same text content from the video frame.

For example, each frame of video in the video may be analyzed in units of text groups to determine whether text groups with the same text content exist in different video frames. The text group may be a text with a set length in the text content, for example, the text group may include a line of text, a column of text, or a segment of text, and the embodiment is not limited. If the video frame has the text group with the same text content, extracting the area where the text group is located from the video frame to form an image frame sequence.

When determining whether text groups with the same text content exist in different video frames, the image features of each text group in each video frame can be extracted, wherein the image features comprise at least one of color features, texture features, shape features and spatial relationship features. And determining the text groups with the same image characteristics in different video frames as the text groups with the same text content. The image feature of each text group in each frame of video frame can be extracted by using a network model such as vgnet, resnet, etc., which is not limited in this embodiment.

When determining whether text groups with the same text content exist in different video frames, a plurality of texts in the same set position in each text group can be identified. The setting position and the text number of the setting position may be set according to actual conditions, and this embodiment is not limited. It should be noted that the number of texts at the set position should be as small as possible, so as to reduce the workload in the process of text group recognition and improve the recognition efficiency. For example, the first two characters in each text group are identified or the last two characters in each text group are identified, etc. And determining a text group with the same text content in a plurality of texts at the same position as the text group with the same text content.

As yet another example, each frame in the video may be analyzed to first detect the degree of text content overlap between video frames. If the coincidence degree of the text content of a plurality of video frames is larger than the set value, the video frame sequence can be composed of the plurality of video frames with the coincidence degree of the text content larger than the set value, and then the image frame sequence with the same text content is extracted from the video frame sequence. Compared with the method for extracting the text group with the same text content from all the video frames, the method for extracting the text group with the same text content from all the video frames firstly extracts the video frame sequence from all the video frames and then extracts the image frame sequence from the video frame sequence, so that the workload of character recognition can be reduced, and the overall recognition efficiency is improved.

In this embodiment, after an image frame sequence having the same text content is extracted, text recognition is performed on the image frame sequence. Specifically, in the present embodiment, the text feature of the nth image frame is extracted from the nth image frame in the image frame sequence with the same text content, where N is a positive integer.

The text feature includes at least one of a position feature, a semantic feature, and a visual feature of the text in the nth image frame. It is a very mature existing technology in the field to extract the position feature, semantic feature and visual feature from the image by using the neural network model, and those skilled in the art can extract the text feature of the nth image frame from the nth image frame in the image frame sequence with the same text content by referring to the description in the existing technology.

Illustratively, as shown in fig. 2, the text content in the nth image frame a is in handwritten chinese. If the text features of the Nth image frame A comprise position features, semantic features and visual features of the text in the Nth image frame A, detecting the position of each character in the Nth image frame A through a character detection model, and then extracting the position features of the text in the Nth image frame A from the position embedding of the position of each character in the Nth image frame A; taking the output of the character detection model backbone as the visual feature of the text in the Nth image frame A; and sending the Nth image frame A into an identification model to obtain an identification result of the identification model, namely 'I is Xiaoming', and sending the identification result of the identification model, namely 'I is Xiaoming', into the BERT model to obtain semantic features of the text in the Nth image frame A output by the BERT model.

The image containing the text content can be used as a training sample, the position of each character in the training sample is used as a label, the character detection model is trained, and when the loss value of the character detection model is smaller than a set value after training, the character detection model is trained. The image containing the text content can be used as a training sample, the recognition text corresponding to the text content in the training sample is used as a label, the recognition model is trained, and when the loss value of the text recognition model is smaller than a set value after training, the recognition model training is completed.

It should be noted that, if the text features of the nth image frame include the position features, the semantic features, and the visual features of the text in the nth image frame, the position features, the semantic features, and the visual features may be used as the text features, or as shown in fig. 2, the position features, the semantic features, and the visual features are first subjected to fusion processing, and the features after the fusion processing are used as the text features. Similarly, if the text features of the nth image frame only include two features of the position feature, the semantic feature and the visual feature of the text, the two text features may be determined as the text features, or the two text features may be fused first, and the feature after the fusion processing is used as the text feature.

In this embodiment, the initial value of N is 1, that is, in this embodiment, the text feature of the first image frame is first extracted from the first image frame in the image frame sequence with the same text content.

S102, performing text recognition on the (N + 1) th image frame in the image frame sequence based on the text characteristics of the (N) th image frame to obtain a recognition text corresponding to the (N + 1) th image frame.

Based on the above steps, after the text features of the nth image frame in the image frame sequence are extracted, the text recognition may be performed on the (N + 1) th image frame in the image frame sequence based on the text features of the nth image frame, so as to determine the recognition text corresponding to the (N + 1) th image frame in combination with the text features of the nth image frame.

The recognition text corresponding to the (N + 1) th image frame can be extracted by using a pre-trained text content recognition model. Specifically, the text feature of the nth image frame and the (N + 1) th image frame are input into the text content recognition model, so that the text content recognition model can perform text recognition on the (N + 1) th image frame in the image frame sequence based on the text feature of the nth image frame, and obtain a recognition text corresponding to the (N + 1) th image frame output by the text content recognition model.

The training sample of the text content identification model is an n +1 th sample image frame in the sample image frame sequence, and the text feature of the nth sample image frame is labeled as the identification text corresponding to the nth sample image frame. When a text content recognition model is trained, inputting a training sample into the text content recognition model, calculating the cross entropy loss of the text content recognition model based on the recognition result and the label of the text content recognition model, and updating model parameters by utilizing a Back Propagation (BP) algorithm; and repeating the training steps until the cross entropy loss of the text content recognition model is less than a set value, and finishing the training of the text content recognition model.

For example, if the value of N is 1, after the text feature of the first image frame is extracted, based on the text feature of the first image frame and a second image frame in the image frame sequence, performing text recognition on the second image frame to obtain a recognition text corresponding to the second image frame.

S103, detecting whether the N +1 is equal to the sequence number of the image frame sequence; if N +1 is smaller than the sequence number of the image frame sequence, let N = N +1, and repeat step S101; if N +1 is equal to the number of sequences of the image frame sequence, step S104 is executed.

The number of sequences of the image frame sequence is the number of image frames included in the image frame sequence. For example, if the number of image frame sequences is 20, this indicates that the image frame sequence includes 20 image frames.

In this step, it is detected whether the value of N +1 is equal to the number of sequences of the image frame sequence, if the value of N +1 is smaller than the number of sequences of the image frame sequence, N = N +1, and step S101 is repeatedly executed. If the value of N +1 is equal to the sequence number of the image frame sequence, step S104 is performed.

For example, if the number of sequences of the image frame sequence is 20, and the value of n is 1, in this embodiment, the text feature of the first image frame is extracted, and after the text feature of the first image frame is obtained through extraction, text recognition is performed on the second image frame based on the text feature of the first image frame and the second image frame in the image frame sequence, so as to obtain a recognition text corresponding to the second image frame.

At this time, if the value of N is 1, N +1 and the value of N +1 is less than the sequence number of the image frame sequence, then N =2, and step S101 is repeatedly executed to extract the text feature of the second image frame, and perform text recognition on the third image frame based on the text feature of the second image frame and the third image frame in the image frame sequence, so as to obtain a recognition text corresponding to the third image frame.

At this time, if the value of N is 2, N +1 and the value of N +1 is 3, the value of N +1 is smaller than the sequence number of the image frame sequence, then N =3, and step S101 is repeatedly executed to extract the text feature of the third image frame, and perform text recognition on the fourth image frame based on the text feature of the third image frame and the fourth image frame in the image frame sequence, so as to obtain a recognition text corresponding to the fourth image frame.

At this time, the value of N is 3, N +1 and the value of N +1 is less than the sequence number of the image frame sequence, then N =4 is executed, and step S101 is executed repeatedly, and the above is executed repeatedly.

And when the value of N is 18, the value of N +1 is 19, and the value of N +1 is smaller than the sequence number of the image frame sequence, so that N =19, and step S101 is repeatedly executed to extract the text feature of the nineteenth image frame, and perform text recognition on the twentieth image frame based on the text feature of the nineteenth image frame and the twentieth image frame in the image frame sequence, so as to obtain a recognition text corresponding to the twentieth image frame.

At this time, the value of N is 19, the value of N +1 is 20, the value of N +1 is equal to the sequence number 20 of the image frame sequence, the image frame identification in the image frame sequence is completed, and step S104 is executed.

It should be noted that when the first image frame of the image frame sequence is identified, that is, N =1, since no text feature has been extracted yet, and only the first image frame is subjected to text identification, the accuracy of the identification may be low, and therefore, when N =1, only the text feature of the first image frame may be extracted without performing text identification, and the text feature of the first image frame is used to assist in performing text identification on the second image frame. The text recognition may be performed only on the first image frame to obtain a recognition text corresponding to the first image frame, which is not limited in this embodiment.

And S104, determining the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame.

In the embodiment of the application, the identification texts corresponding to the image frame sequence are determined from the identification texts corresponding to all the image frames of the image frame sequence.

Specifically, after the text features of the nth image frame and the (N + 1) th image frame are input to the text content recognition model, the text content recognition model outputs the recognition text corresponding to the (N + 1) th image frame, and also outputs the confidence level of the recognition text. In this embodiment, the confidence of the recognition text corresponding to each image frame may be detected, and the recognition text with the highest confidence is used as the recognition text corresponding to the image frame sequence.

In one embodiment, text recognition is performed for the first image frame. For example, the first image frame is subjected to text recognition using the recognition model of the above embodiment. Besides outputting the recognition text corresponding to the first image frame, the recognition model also outputs the confidence level of the recognition text. In another embodiment, no text recognition may be performed for the first image frame, and the confidence level of the recognized text for the first image frame is zero.

And splicing the identification texts corresponding to all the video frame sequences in the video together to obtain the identification texts in the video.

In the above embodiment, the text content is identified for a plurality of image frame sequences with the same text content, and the text content can be identified for a plurality of image frame sequences with the same text content, so that interference of other image frames without the same text content is reduced, and the text identification accuracy in a video scene is improved; moreover, when the current frame is subjected to text recognition, the position characteristic, the semantic characteristic, the visual characteristic and other mode information of the previous frame of picture can be combined, so that the information amount in the text recognition process is improved, and the text recognition accuracy in a video scene is further improved.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that, if N =1, the text recognition method in the above embodiment may specifically further include the following steps:

In this embodiment, text recognition is performed on the first image frame. For example, the recognition model of the above embodiment may be used to perform text recognition on the first image frame, so as to obtain a recognition text corresponding to the first image frame; the first image frame may also be subjected to text recognition by using a mature OCR technology in the prior art to obtain a recognition text corresponding to the first image frame, which is not limited in this embodiment.

In the embodiment of the application, the first image frame is subjected to text recognition to obtain the recognition text, so that the information in the first image frame can be effectively avoided from being missed in the process of recognizing the text of the image frame sequence.

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the step of the above embodiment extracts a text feature of an nth image frame from the nth image frame in an image frame sequence with the same text content, and includes:

extracting feature information from an Nth image frame in an image frame sequence with the same text content, wherein the feature information comprises at least one of a position feature, a semantic feature and a visual feature of a text; and fusing the extracted feature information to obtain the text feature of the Nth image frame.

According to the embodiment of the application, at least one of the position feature, the semantic feature and the visual feature of a text is extracted from the Nth image frame to serve as feature information, then the feature information extracted from the Nth image frame is subjected to fusion processing, and the fusion feature obtained after the fusion processing is determined to be the text feature of the Nth image frame.

Feature information may be fused in a splicing and fusing manner, or feature information may be fused in a direct adding manner, which is not limited in this embodiment.

In the embodiment of the application, at least one of the position feature, the semantic feature and the visual feature of the text extracted from the Nth image frame is used for assisting in identifying the text of the (N + 1) th image frame, so that the information amount in the text identification process can be increased, and the text identification accuracy in a video scene is further improved.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the steps of the foregoing embodiment extract and obtain text features of an nth image frame from an nth image frame in an image frame sequence with the same text content, and perform text recognition on an (N + 1) th image frame in the image frame sequence based on the text features of the nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame, including:

inputting the image frame sequence with the same text content into a pre-trained text recognition model, so that the text recognition model extracts the text features of the Nth image frame from the Nth image frame of the image frame sequence, and performing text recognition on the (N + 1) th image frame in the image frame sequence based on the text features of the Nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame.

Specifically, in the embodiment of the present application, a text recognition model is trained in advance. The text recognition model can be trained by adopting an Encoder-Decoder framework as a basic framework. The training sample of the text recognition model is a sample image frame sequence with the same text content, and the label is a recognition text of each sample image frame in the sample image frame sequence. In the training process, the training samples are input into a text recognition model, and the text recognition model recognizes the recognition text corresponding to each sample image frame in an autoregressive mode.

FIG. 3 illustrates a process flow for a training sample set of text recognition models. Specifically, as shown in fig. 3, the text recognition model includes a text recognition network and a feature extraction network. The method comprises the steps that a text recognition network recognizes a recognition text in a first sample image frame in a sample image frame sequence as an output result corresponding to the first sample image frame, and a feature extraction network extracts text features of the recognition text in the first sample image frame; based on the text characteristics of the first sample image frame, the text recognition network recognizes a second sample image frame in the sample image frame sequence to obtain a recognition text in the second sample image frame as an output result corresponding to the second sample image frame, and the characteristic extraction network extracts the text characteristics of the recognition text in the second sample image frame; based on the text feature of the second sample image frame, the text recognition network recognizes a third sample image frame in the sample image frame sequence to obtain a recognition text in the third sample image frame as an output result corresponding to the third sample image frame, and the feature extraction network extracts the text feature of the recognition text in the third sample image frame. And repeating the steps until the text recognition network recognizes the last sample image frame in the sample image frame sequence based on the text features of the recognized text in the last sample image frame in the sample image frame sequence to obtain the recognized text in the last sample image frame as the output result corresponding to the last sample image frame.

And calculating cross entropy loss based on the label of each sample image frame and the output result corresponding to each sample image frame, and then updating the parameters of the text recognition network and/or the feature extraction network by using a BP algorithm with the aim of reducing the cross entropy loss.

And repeating the training steps until the cross entropy loss of the text recognition model is less than a set value, and finishing the training of the text recognition model.

In this embodiment, the image frame sequences with the same text content are input into a pre-trained text recognition model, and the pre-trained text recognition model can recognize and output a recognition text corresponding to each image frame in the image frame sequence.

In the above embodiment, each image frame in the image frame sequence is identified by using the pre-trained text identification model, so that the identification speed is high, and the accuracy of the text identification model is high as more training samples are set in the training process.

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the determining, according to the identification text corresponding to each image frame, the identification text corresponding to the image frame sequence by the steps of the above embodiment includes:

detecting the confidence coefficient of the recognition text corresponding to each image frame; and determining the recognition text with the highest confidence coefficient as the recognition text corresponding to the image frame sequence.

In the embodiment, the confidence of the recognition text corresponding to each image frame can be detected, and the recognition text with the highest confidence is used as the recognition text corresponding to the image frame sequence, so that the accuracy of text content recognition in the image frame sequence can be improved.

As an alternative implementation, in another embodiment of the present application, it is disclosed that the image frame sequence with the same text content of the above embodiment is extracted from a video.

Specifically, the video refers to a video including text content to be recognized. The embodiment analyzes each frame in the video, and intercepts the image frame sequence with the same text content from the video frame, so as to perform text recognition on the image frame sequence with the same text content. The text content is identified aiming at a plurality of image frame sequences with the same text content, so that the interference of other image frames without the same text content can be reduced, and the text identification accuracy in a video scene is improved.

As an alternative implementation, as shown in fig. 4, in another embodiment of the present application, an extraction process of an image frame sequence with the same text content is disclosed, which specifically includes the following steps:

s401, video frames with text content coincidence degree higher than a set coincidence degree threshold value are extracted from the video to form a video frame sequence.

In this embodiment, each frame in the video may be analyzed, and the overlap ratio of the text content between the video frames may be detected first. If the coincidence degree of the text content of a plurality of video frames is higher than the set coincidence degree threshold value, the video frame sequence can be formed by the plurality of video frames with the coincidence degree of the text content higher than the set coincidence degree threshold value.

For example, image features of video frames may be extracted, the number of the same image features in different video frames is detected, and if the number of the same image features in a plurality of different video frames is higher than a set overlap ratio threshold, the plurality of different video frames are combined into a video frame sequence. The overlap ratio threshold may be set according to actual conditions, and this embodiment is not limited.

As another example, each frame of video frame in the video may be analyzed in units of text groups to determine whether text groups with the same text content exist in different video frames. If a plurality of video frames have text groups with the same text content, the fact that the plurality of video frames have overlapped content is shown, and the overlapped content is the text group with the same text content. And calculating the ratio of the overlapped content to all the content in each of the plurality of video frames to serve as the overlap ratio, and forming a video frame sequence by the video frames with the overlap ratio higher than a set overlap ratio threshold value. The overlap ratio threshold may be set according to actual conditions, and this embodiment is not limited.

If there are video frames whose degree of overlap with the text content of other video frames in the video is less than the set threshold degree of overlap, the video frames may be individually combined into a video frame sequence or discarded.

S402, extracting image frame sequences with the same text content from the video frame sequences.

And intercepting the area where the text group with the same text content is located from the video frame sequence to form an image frame sequence with the same text content.

It should be noted that if there is an image frame in which the text group is different from the text group in other image frames, such image frame may be individually composed into a video frame sequence or discarded. When such an image frame is selected to be individually composed into a video frame sequence, and the image frame is recognized, text recognition is performed only on the image frame to obtain a recognition text corresponding to the image frame, and the recognition text corresponding to the image frame is used as the recognition text of the video frame sequence.

In this embodiment, the video frame sequence is extracted from all the video frames, and then the image frame sequence is extracted from the video frame sequence, so that the workload of character recognition can be reduced, and the overall recognition efficiency can be improved.

As an optional implementation manner, as shown in fig. 5, in another embodiment of the present application, it is disclosed that the steps of the foregoing embodiment extract a video frame whose text content overlap ratio is higher than a set overlap ratio threshold from a video, and specifically may include the following steps:

s501, performing text recognition on the set characters of each text group of each video frame in the video to obtain recognition characters corresponding to each text group.

The setting characters refer to characters of a set position and a set number. The set positions and the set number can be set according to actual conditions, and the embodiment is not limited. For example, the first two characters in each text group are identified or the last two characters in each text group are identified, etc. In this embodiment, text recognition is performed on the set characters of each text group in each video frame. The text recognition may adopt the recognition model of the above embodiment, or adopt an OCR technology to perform text recognition, which is not limited in this embodiment, and obtain a recognition character corresponding to each text group.

Illustratively, the content shown in fig. 6 includes a video frame 61, and if a line is taken as a text group and characters are set as the first two characters in each line, text recognition is performed on the first two characters in each line in the video frame 61, so as to obtain a recognized text 62 shown in fig. 6.

S502, traversing the identification characters corresponding to each text group in the adjacent video frames, and determining the number of the target text groups with the same identification characters in the adjacent video frames.

Due to the continuity of the video, the text content overlap ratio of the adjacent video frames is more likely to be higher than the set overlap ratio threshold. Based on this, in this embodiment, the identification characters corresponding to each text group in the adjacent video frames are traversed, and the number of the target text groups with the same identification characters in the adjacent video frames is determined.

Taking the embodiment shown in fig. 6 as an example, in fig. 6, the first frame and the second frame are adjacent video frames, and the second frame and the third frame are adjacent video. By traversing the recognition characters corresponding to the text groups of the first frame and the second frame in fig. 6, it can be determined that the number of target text groups with the same recognition characters between the first frame and the second frame in fig. 6 is 3; by traversing the recognition characters corresponding to the text groups of the second frame and the third frame in fig. 6, it can be determined that the number of target text groups having the same recognition character between the second frame and the third frame in fig. 6 is 0.

S503, if the number of the target text groups in the adjacent video frames reaches the set condition, determining that the adjacent video frames are the video frames with the text content coincidence degree higher than the set coincidence degree threshold.

And if the number of the target text groups in the adjacent video frames can reach the set condition, determining that the adjacent video frames are the video frames with the text content coincidence degree higher than the set coincidence degree threshold. The setting condition may be set according to an actual situation, for example, the number of the target text groups may be set to be more than a set percentage of the total text group amount of each video frame in the adjacent video frames, for example, the number of the target text groups may be more than 90% of the total text group amount of each video frame in the adjacent video frames; the area occupied by the target text group may also be set to be more than a set percentage of the total text area of each video frame in the adjacent video frames, for example, the area occupied by the target text group may be more than 90% of the total text area of each video frame in the adjacent video frames, which is not limited in this embodiment.

For example, if the number of the target text groups is set to be more than 90% of the total number of text groups of each of the adjacent video frames, determining that the adjacent video frames are video frames with text content overlap ratio higher than the set overlap ratio threshold, taking the embodiment shown in fig. 6 as an example, the number of target text groups with the same identification characters between the first frame and the second frame is 3, the ratio of the number of the target text groups to the number of text groups in the first frame is 100%, and the ratio of the number of the target text groups to the number of text groups in the second frame is 100%, and both are greater than 90%, determining that the first frame and the second frame in fig. 6 are video frames with text content overlap ratio higher than the set overlap ratio threshold; the number of target text groups with the same identification characters between the second frame and the third frame is 0, the ratio of the number of the target text groups to the number of the text groups in the second frame is 0, the ratio of the number of the target text groups to the number of the text groups in the third frame is 0, and both the number of the target text groups and the number of the text groups in the third frame are smaller than 90%, and it is determined that the second frame and the third frame in fig. 6 are not video frames with text content coincidence higher than a set coincidence degree threshold value.

In the above embodiment, it can be determined whether the adjacent video frames are the video frames whose text content coincidence degree is higher than the set coincidence degree threshold value only by recognizing the set number of characters in each text group, and it is not necessary to recognize all the characters in the video frames, so that the recognition speed is high, and the recognition efficiency is high.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that, if the number of target text groups in adjacent video frames reaches a set condition, the step of determining that the adjacent video frames are video frames with text content overlapping degrees higher than a set overlapping degree threshold may specifically include the following steps:

calculating the ratio of the number of the target text groups in the adjacent video frames to the maximum value of the number of the text groups in the adjacent video frames; if the ratio is larger than the set value, the number of the target text groups in the adjacent video frames reaches the set condition, and the adjacent video frames are determined to be the video frames with the text content coincidence degree higher than the set coincidence degree threshold value.

In this embodiment, only the ratio of the number of the target text groups in the adjacent video frames to the maximum number of the text groups in the adjacent video frames is calculated, because the ratio of the number of the target text groups in the adjacent video frames to the minimum number of the text groups in the adjacent video frames is necessarily greater than the ratio of the number of the target text groups in the adjacent video frames to the maximum number of the text groups in the adjacent video frames. If the ratio of the number of the target text groups in the adjacent video frames to the maximum value of the number of the text groups in the adjacent video frames is larger than a set value, the number of the target text groups in the adjacent video frames can be determined to reach the set condition.

Illustratively, if the number of target text groups in adjacent video frames is 5, one video frame in adjacent video frames includes 5 text groups, the other video frame includes 6 text groups, the maximum value of the number of text groups in adjacent video frames is 5, and the ratio of the number of target text groups in adjacent video frames 5 to the maximum value of the number of text groups in adjacent video frames 6 is calculated to obtain 0.83. If 0.83 is greater than the set value, it may be determined that the number of target text groups in the adjacent video frame reaches the set condition. For example, if the set value is 0.7, it indicates that the number of target text groups in adjacent video frames reaches the set condition, and it is determined that the adjacent video frames are video frames whose text content overlap ratio is higher than the set overlap ratio threshold.

In the above embodiment, only the ratio of the number of the target text groups in the adjacent video frames to the maximum value of the number of the text groups in the adjacent video frames is calculated, and whether the number of the target text groups in the adjacent video frames reaches the set condition is determined based on the ratio, so that the calculation amount can be reduced, and the overall recognition speed can be improved.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the steps of the foregoing embodiment extract a sequence of image frames with the same text content from a sequence of video frames, and specifically may include the following steps:

respectively identifying texts in adjacent video frames, wherein the distance between the texts and the target text group is within a set distance threshold range; and if the texts in the adjacent video frames, the distances between which and the target text group are within the range of the set distance threshold value, extracting the region where the target text group is located from the video frame sequence to form an image frame sequence.

Specifically, in the above embodiment, only the set characters in the text group are recognized, and in the case where the set characters are the same, the recognized text group is considered to be the same text group. However, in an actual recognition process, there is a special case where the set characters of the text groups are the same but the recognized text groups are different from each other, and if the image frame sequence generated from such text groups is subjected to text recognition, it is difficult to obtain an accurate recognition result.

In order to avoid the influence of the above situation on the recognition result, a step of verifying the target text group is added in this embodiment. If the target text group in the adjacent video frame is the same text group, the texts in the adjacent video frame which are within the set distance threshold range from the target text group are also the same.

Based on this, the verification process is to detect whether the texts in the adjacent video frames, which are within the range of the set distance threshold from the target text group, are the same, and if the texts in the adjacent video frames, which are within the range of the set distance threshold from the target text group, are the same, it may be determined that the target text group determined in the above embodiment is the same text group, and the region where the target text group is located may be extracted from the video frame sequence to form the image frame sequence.

The distance may be euclidean distance, the set distance threshold range and the number of texts in the set distance threshold range may be determined according to actual conditions, for example, the set distance threshold range may specify a specific distance range, and may also detect a text closest to or farthest from the target text group, and the number of texts in the set distance threshold range may be 1 or more, which is not limited in this embodiment.

In a specific real-time scene, three texts with the nearest distance to a target text group in adjacent video frames are detected, if the texts with the nearest distance to the target text group in the adjacent video frames are the same, the next nearest text is the same, and the next nearest text is also the same, the target texts in the adjacent video frames are determined to be the same, and the region where the target text group is located is extracted from a video frame sequence to form an image frame sequence.

Then, as shown in fig. 7, in two adjacent video frames, three texts closest to the target text group in the first frame are "not", "this" and "water", and three texts closest to the target text group in the second frame are also "not", "this" and "water", so that the target text groups in the first frame and the second frame can be determined to be the same text group.

In the above embodiment, the step of verifying the target text group is added, so that the situation that the text groups have the same set characters but different recognized text groups can be avoided, and the accuracy of recognizing the video frame sequence is improved.

As an optional implementation manner, another embodiment of the present application discloses that the text recognition method of the foregoing embodiment may specifically include the following steps:

and combining the identification texts corresponding to all the image frame sequences with the same text content in the video to obtain the identification texts corresponding to the video.

Specifically, the identification texts corresponding to the image frame sequences extracted from the same video frame sequence are arranged according to the writing sequence of the text content in the video, and the identification texts corresponding to the image frame sequences extracted from different video frame sequences are arranged according to the shooting time sequence of the video, so that the identification texts corresponding to the video are obtained.

The writing sequence of the text content may be input by a user, or may be obtained through semantic recognition and detection, which is not limited in this embodiment.

In the above embodiment, the text content is identified for a plurality of image frame sequences having the same text content, and then the identified texts obtained by identification are combined together, so that the text identification accuracy in a video scene can be improved.

Moreover, by using the technical scheme of the embodiment, text contents on the curved surface can be identified, such as an instruction book, an ingredient table and the like on the bottle body. The user can shoot the video containing the text content on the curved surface, for example, shoot the video of the text content on the left side first, and then shoot the video of the text content on the right side, so as to obtain the video containing the regular text content. Then, by using the text recognition method of the embodiment, the recognition text corresponding to the video can be obtained.

Exemplary devices

Corresponding to the text recognition method, the embodiment of the present application further discloses a text recognition apparatus, as shown in fig. 8, the apparatus includes:

an extraction module 100, configured to perform step S1, to extract a text feature of an nth image frame from the nth image frame in an image frame sequence with the same text content; wherein N is a positive integer;

the recognition module 110 is configured to execute step S2, perform text recognition on an (N + 1) th image and frame in the image frame sequence based on the text feature of the nth image frame, to obtain a recognition text corresponding to the (N + 1) th image frame;

a repeating module 120, configured to execute step S3, make N = N +1, control the extracting module to repeatedly execute step S1 and the identifying module to repeatedly execute step S2, and determine the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame when N is equal to the number of sequences of the image frame sequence.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that, if N =1, the text recognition apparatus in the above embodiment further includes:

and the text recognition module is used for performing text recognition on a first image frame in the image frame sequence to obtain a recognition text corresponding to the first image frame.

As an alternative implementation, in another embodiment of the present application, it is disclosed that the text feature of the nth image frame includes at least one of a position feature, a semantic feature and a visual feature of the text in the nth image frame.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the extraction module 100 includes:

the image processing device comprises a first extraction unit, a second extraction unit and a third extraction unit, wherein the first extraction unit is used for extracting characteristic information from an Nth image frame in an image frame sequence with the same text content, and the characteristic information comprises at least one of position characteristics, semantic characteristics and visual characteristics of a text;

and the fusion unit is used for fusing the extracted feature information to obtain the text feature of the Nth image frame.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the extraction module 100 extracts a text feature of an nth image frame from an nth image frame in an image frame sequence with the same text content, and the recognition module 110 performs text recognition on an (N + 1) th image frame in the image frame sequence based on the text feature of the nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame, and is specifically configured to:

As an alternative implementation, in another embodiment of the present application, it is disclosed that the repeating module 120 includes:

the detection unit is used for detecting the confidence of the identification text corresponding to each image frame;

and the determining unit is used for determining the recognition text with the highest confidence coefficient as the recognition text corresponding to the image frame sequence.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that the text recognition apparatus of the above embodiment further includes:

and the image frame sequence extraction module is used for extracting the image frame sequences with the same text content from the video.

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the image frame sequence extraction module of the above embodiment includes:

the second extraction unit is used for extracting video frames with text content coincidence degree higher than a set coincidence degree threshold value from the video to form a video frame sequence;

and a third extraction unit for extracting a sequence of image frames having the same text content from the sequence of video frames.

As an optional implementation manner, in another embodiment of the present application, it is disclosed that, when the second extraction unit of the above embodiment extracts a video frame from a video, where the text content coincidence degree is higher than a set coincidence degree threshold, the second extraction unit is specifically configured to:

performing text recognition on the set characters of each text group of each video frame in the video to obtain recognition characters corresponding to each text group; traversing the identification characters corresponding to each text group in the adjacent video frames, and determining the number of target text groups with the same identification characters in the adjacent video frames; and if the number of the target text groups in the adjacent video frames reaches the set condition, determining that the adjacent video frames are the video frames with the text content coincidence degree higher than the set coincidence degree threshold.

As an optional implementation manner, another embodiment of the present application discloses that, if the number of target text groups in adjacent video frames reaches a set condition, the second extraction unit of the above embodiment, when determining that the adjacent video frames are video frames whose text content overlap ratio is higher than a set overlap ratio threshold, is specifically configured to:

As an alternative implementation manner, in another embodiment of the present application, it is disclosed that, when the third extracting unit of the above embodiment extracts a sequence of image frames with the same text content from a sequence of video frames, the third extracting unit is specifically configured to:

respectively identifying texts with the distances between the adjacent video frames and the target text group within a set distance threshold range; and if the texts in the adjacent video frames, the distances between which and the target text group are within the range of the set distance threshold value, extracting the region where the target text group is located from the video frame sequence to form an image frame sequence.

and the combination module is used for combining the identification texts corresponding to all the image frame sequences with the same text content in the video to obtain the identification texts corresponding to the video.

Specifically, please refer to the contents of the above method embodiments for the specific working contents of each unit of the text recognition apparatus, which are not described herein again.

Exemplary electronic device, computer product, and storage medium

Corresponding to the text recognition method, an embodiment of the present application further discloses an electronic device, as shown in fig. 9, the electronic device includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected with the processor 210 for storing programs;

the processor 210 is configured to implement the text recognition method disclosed in any of the above embodiments by executing the program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present disclosure. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present application, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any means for using a transceiver or the like to communicate with other devices or communication networks, such as ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the text recognition method provided in the above-described embodiments of the present application.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by the processor 210, cause the processor 210 to perform the various steps of the text recognition method provided by the above-described embodiments.

The computer program product may include program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor 210 to perform the steps of the text recognition method provided by the above-described embodiments.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Specifically, the specific working contents of each part of the electronic device, the computer program product, and the storage medium, and the specific processing contents of the computer program product or the computer program on the storage medium when the computer program is executed by the processor may refer to the contents of each embodiment of the text recognition method, which are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments are presented as a series of acts or combinations, it will be appreciated by those of ordinary skill in the art that the present application is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules or sub-modules in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules are integrated into one module. The integrated modules or sub-modules can be implemented in the form of hardware, and can also be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text recognition method, comprising:

and S3, letting N = N +1, and repeatedly executing the step S1 and the step S2 until N is equal to the number of the image frame sequences, and determining the identification text corresponding to each image frame sequence according to the identification text corresponding to each image frame.

2. The method of claim 1, wherein if N =1, the method further comprises:

3. The method of claim 1, wherein the text features of the nth image frame comprise at least one of location features, semantic features, and visual features of text in the nth image frame.

4. The method of claim 3, wherein extracting the text feature of the Nth image frame from the Nth image frame in the image frame sequence with the same text content comprises:

5. The method according to claim 1, wherein extracting a text feature of an nth image frame from the nth image frame in an image frame sequence with the same text content, and performing text recognition on an N +1 th image frame in the image frame sequence based on the text feature of the nth image frame to obtain a recognition text corresponding to the N +1 th image frame comprises:

inputting an image frame sequence with the same text content into a pre-trained text recognition model so as to enable the text recognition model to extract text features of an Nth image frame from the Nth image frame of the image frame sequence, and performing text recognition on an (N + 1) th image frame in the image frame sequence based on the text features of the Nth image frame to obtain a recognition text corresponding to the (N + 1) th image frame.

6. The method of claim 1, wherein determining the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame comprises:

7. The method of claim 1, further comprising: the image frame sequences with the same text content are extracted from the video.

8. The method according to claim 7, wherein the process of extracting the image frame sequences with the same text content comprises:

9. The method of claim 8, wherein extracting video frames from the video having text content overlap above a set overlap threshold comprises:

10. The method of claim 9, wherein determining that the neighboring video frame is a video frame with a text content overlapping degree higher than a set overlapping degree threshold if the number of target text groups in the neighboring video frame reaches a set condition comprises:

if the ratio is larger than a set value, the number of the target text groups in the adjacent video frames reaches a set condition, and the adjacent video frames are determined to be video frames with text content coincidence degree higher than a set coincidence degree threshold value.

11. The method of claim 9, wherein extracting a sequence of image frames having the same textual content from the sequence of video frames comprises:

and if the texts in the adjacent video frames, the distances between which and the target text group are within a set distance threshold range, extracting the region where the target text group is located from the video frame sequence to form the image frame sequence.

12. The method of claim 1, further comprising: and combining the identification texts corresponding to all the image frame sequences with the same text content in the video to obtain the identification texts corresponding to the video.

13. A text recognition apparatus, comprising:

and a repeating module, configured to execute step S3, make N = N +1, control the extracting module to repeatedly execute step S1 and control the identifying module to repeatedly execute step S2 until N is equal to the number of sequences of the image frame sequence, and determine the identification text corresponding to the image frame sequence according to the identification text corresponding to each image frame.

14. An electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing programs;

the processor is configured to implement the text recognition method according to any one of claims 1 to 12 by executing the program in the memory.

15. A storage medium, comprising: the storage medium has stored thereon a computer program which, when executed by a processor, implements a text recognition method as claimed in any one of claims 1 to 12.