WO2022206534A1

WO2022206534A1 - Method and apparatus for text content recognition, computer device, and storage medium

Info

Publication number: WO2022206534A1
Application number: PCT/CN2022/082690
Authority: WO
Inventors: 林建民
Original assignee: 广州视源电子科技股份有限公司; 广州视源人工智能创新研究院有限公司
Priority date: 2021-03-29
Filing date: 2022-03-24
Publication date: 2022-10-06
Also published as: CN115131693A

Abstract

A method and apparatus for text content recognition, a computer device, and a storage medium, comprising: acquiring a video frame image currently collected (S201); detecting a fingertip in the video frame image and acquiring the tip position of the fingertip (S202); with the tip position serving as a reference, capturing a candidate area in the video frame image (S203); by means of tip regression positioning performed with respect to the candidate area and text detection performed with respect to the candidate area, acquiring a text area in the candidate area (S204); and recognizing a text content in the text area (S205).

Description

Text content recognition method, device, computer equipment and storage medium

This application claims the priority of the Chinese Patent Application No. 202110336251.7 filed with the China Patent Office on March 29, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the field of computer technology, for example, to a text content recognition method, apparatus, computer device and storage medium.

Background technique

With the development of computer technology, the fingertip word search technology based on desktop learning equipment has emerged. Specifically, the user points to an unknown target word through the finger, and the desktop learning device captures the picture of the finger pointing to the target word through the camera. point to the text content, and perform word query on the recognized text content to obtain the annotation content of the target word in the dictionary, and then display the target word obtained by the query and the annotation content of the target word on the display screen of the desktop learning device, thereby The user only needs to point to the target word to obtain the annotation content of the target word. In this process, the accuracy of the recognition of the target word is an important content that affects the result of the word query, but in the traditional way, the accuracy of the recognition of the target text is not high.

SUMMARY OF THE INVENTION

The present application provides a text content recognition method, device, computer equipment and storage medium capable of improving the accuracy of text recognition for fingertip search words.

In a first aspect, the present application provides a text content recognition method, the method comprising:

Get the currently captured video frame image;

Detecting the fingertip in the video frame image, and obtaining the fingertip position of the fingertip;

Taking the position of the fingertip as a benchmark, intercepting the candidate area in the video frame image;

By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, a text region in the candidate region is obtained;

Text content in the text area is identified.

In a second aspect, the present application further provides a text content recognition device, the device comprising:

an image acquisition module, set to acquire the currently acquired video frame image;

A fingertip position detection module, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;

A candidate region interception module, configured to intercept the candidate region in the video frame image based on the fingertip position;

A fingertip regression and text area detection module, configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain a text area in the candidate area;

A content recognition module, configured to recognize the text content in the text area.

In a third aspect, the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method described in the first aspect is implemented.

Description of drawings

Fig. 1 is the application environment diagram of the text content recognition method in one embodiment;

2 is a schematic flowchart of a text content recognition method in one embodiment;

3 is a schematic flowchart of a text content recognition method in an application example;

FIG. 4 is a schematic diagram of an interception candidate region in an application example;

FIG. 5 is a schematic diagram of an interception candidate region in another application example;

6 is a schematic diagram of fingertip regression and text detection in an application example;

7 is a structural block diagram of a text content recognition device in one embodiment;

FIG. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed ways

The present application will be described in detail below with reference to the accompanying drawings and embodiments.

The text content recognition method provided in this application can be applied to the application environment shown in FIG. 1 . The computer device 20 is placed on the desktop 10 , and the computer device 20 carries or is externally connected with a camera device 30 . When in use, the user places the text data 40 for finger reading on the desktop 10, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures this picture to obtain a video frame image, and the computer device 20 obtains the video. frame image, and process the video frame image to identify the text content pointed by the user's finger. In some embodiments, the computer device 20 that carries the camera device 30 by itself or is externally connected to the camera device 30 can also be fixed or placed in other positions. When in use, the user holds the text material 40 with one hand or other devices and places it on the camera device 30 . Within the shooting range, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures the picture to obtain a video frame image, and the computer device 20 obtains the video frame image, and processes the video frame image to identify the video frame image. The text content that the user's finger points to.

After identifying the text content pointed by the user's finger, in the application scenario of word lookup, perform word query on the recognized text content, obtain the annotation content of the target word in the dictionary, and then query the obtained target word and target word. The content of the annotation is displayed on the display screen of the computer device 20 , as shown in FIG. 1 . In the application scenario of retrieval, the recognized text content is retrieved, and the retrieved result is displayed on the display screen of the computer device 20 . In other application scenarios, after the text content is recognized, other further processing can also be performed. Among them, the computer device 20 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.

In one embodiment, as shown in FIG. 2 , a text content recognition method is provided, which is described by taking the method applied to the computer device 20 in FIG. 1 as an example, including the following steps S201 to S205 .

Step S201: Acquire a currently collected video frame image.

The video frame image can be obtained by capturing video with a camera device carried by the computer device or a camera device externally connected to the computer device.

Step S202: Detect the fingertip in the video frame image, and obtain the position of the fingertip of the finger.

When detecting the fingertip in the video frame image, various possible ways can be used to detect the fingertip position of the fingertip in the video frame image. Specifically, the fingertip position may be the position of the detected fingertip in the video frame image.

In some embodiments of the present application, the position of the fingertip can be obtained by detecting the video frame image through the fingertip recognition model obtained by pre-training, which may specifically include:

The fingertip recognition model obtained by training detects the fingertips in the video frame images;

When the fingertip recognition model detects the fingertip, output the coordinates of the position of the detected fingertip to obtain the fingertip position of the fingertip;

When the fingertip recognition model does not detect the fingertip, output preset return position coordinates, where the preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.

Therefore, when detected by the fingertip recognition model, when there is a fingertip in the video frame image, the coordinates of the position of the detected fingertip can be directly output, so as to obtain the fingertip position of the finger. The fingertip is not detected in , at this time, the fingertip recognition model outputs the preset regression position coordinates, which avoids the logical error problem that the model cannot output the result when the fingertip recognition model detects that the fingertip is not recognized, so as to ensure that the fingertip recognition model cannot output the result. The accuracy of the recognition results of the tip recognition model, and the availability of the recognition results of the fingertip recognition model.

In some embodiments, the manner of obtaining the fingertip recognition model by training can be performed in the following manner.

First, obtain fingertip sample images, some fingertip sample images have fingertips, and record the corresponding specific fingertip position, which is the target fingertip position of the fingertip sample image. Some of the fingertip sample images do not have a fingertip, and the corresponding specific fingertip position can be set as the preset regression position coordinates. At this time, the preset regression position coordinates are the target of the fingertip sample image. fingertip position.

Then, input the fingertip sample image to the fingertip recognition model for processing, and the fingertip recognition model processes the input fingertip sample image to obtain the training result of the sample fingertip position of the fingertip.

The sample fingertip positions of the sample fingertip images of each finger are compared with the target fingertip positions of the fingertip images of the finger, and the model training error is calculated according to the comparison results. The function for calculating the training error of the model may use any possible error function, such as mean square error, etc., which is not specifically limited in this embodiment of the present application.

If the model training error meets the error requirement and the number of training iterations reaches the preset number of iterations, it is determined that the training end condition is met, and the fingertip recognition model to be trained for the last training is determined as the fingertip recognition model obtained by training. Otherwise, it is determined that the training end condition is not met, and after adjusting the model parameters of the fingertip recognition model to be trained, it returns to the step of inputting the fingertip sample image into the fingertip recognition model for processing until the model training end condition is reached.

In some embodiments, after obtaining the fingertip position of the fingertip, the step may further include:

determining, based on the fingertip position, whether the fingertip is in a fingertip stable state;

When the fingertip is in a stable state of the fingertip, enter the step of intercepting the candidate region in the video frame image based on the fingertip position;

When the fingertip is not in the steady state of the fingertip, return to the step of detecting the fingertip in the video frame image.

Therefore, after the position of the fingertip of the fingertip is detected and the fingertip is in a stable state, the subsequent processing flow is entered, which avoids the resource consumption of entering the subsequent flow when the fingertip is mistakenly detected and the recognition accuracy rate is low. question.

Wherein, in some embodiments, when the offset between the fingertip position and the fingertip position in the previous preset number of adjacent video frame images is less than a preset offset, it may be determined that the finger The tip is in fingertip stability. The preset offset may be set in combination with actual technical needs. In some embodiments of the present application, the preset offset may be based on the resolution of the video capture device (ie, the above-mentioned camera) that captures the video frame image. rate setting, for example, the preset offset is positively related to the resolution of the video capture device that captures the video frame image. In some specific examples, the preset offset may be set to 10 pixel values.

Step S203: Using the fingertip position as a reference, intercept a candidate region in the video frame image.

The fingertip position obtained above is used as a reference, and various possible ways can be used to cut out the candidate region in the video frame image.

In some embodiments, taking the fingertip position as a reference, intercepting a candidate region in the video frame image, including:

Taking the position of the fingertip as the center point, the first number of pixels are respectively extended to both sides of the first coordinate axis, and the second number of pixels are respectively extended to both sides of the second coordinate axis;

A region formed by the expanded fingertip positions is determined as a candidate region in the video frame image.

Wherein, the first coordinate axis may be the axis where one side of the video frame image is located, and the second coordinate axis may be the axis where the other side of the video frame image is located. The first number of pixels and the second number of pixels may be the same or different.

The first number of pixels and the second number of pixels may be set to be the same, or may be set to be different. The specific values of the first number of pixels and the second number of pixels can be determined in combination with the specific placement position of the computer device 20 as shown in FIG. 1 and the resolution and height of the camera device 30 . Usually, when a user uses the computer device 20, the computer device 20 will have some common usage scenarios, such as being placed on a study desk, or it can be fixed at a specific position. When recognizing text content, the computer device 20 usually It is to recognize the text content in various texts such as books, textbooks, or other printed texts with conventional printing sizes. Therefore, it can be combined with the computer equipment 20 in these common usage scenarios, the camera device 30 that it carries or is externally connected. The first number of pixels and the second number of pixels are determined according to the rate, the vertical distance of the camera device 30 relative to the text content pointed by the finger, and the like. In other embodiments, the first number of pixels and the second number of pixels input by the user may also be obtained.

In some embodiments, the region formed by the expanded fingertip positions is determined as a candidate region in the video frame image, including:

When the expanded fingertip position is outside the border of the video frame image, the border corresponding to the expanded fingertip position outside the border of the video frame image is formed with other expanded fingertip positions. The area of is determined as a candidate area in the video frame image.

Therefore, when the expanded fingertip position is outside the boundary of the video frame image, the boundary corresponding to the expanded fingertip position can be used as the basis for determining the candidate region, so as to improve the accuracy of the determined candidate region.

Step S204: By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, a text region in the candidate region is obtained.

By performing fingertip regression positioning on the candidate region, and text detection on the candidate region, when obtaining the text region in the candidate region, the candidate region is further refined by fingertip positioning to obtain a more accurate The position of the fingertip is determined, and the text area in the candidate area is determined accordingly, so as to improve the accuracy of the obtained text area.

In some embodiments, by performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, the text region in the candidate region is obtained, including:

Fingertip localization and text detection are performed on the candidate region by the comprehensive model of fingertip regression and text detection obtained by training, and the text region in the candidate region is obtained.

Among them, the comprehensive model of fingertip regression and text detection can be a model structure that performs fingertip positioning and text area detection at the same time. Adjustment.

In some embodiments, the method for obtaining the comprehensive model of fingertip regression and text detection by training includes the following steps S2041 to S2046.

Step S2041: Acquire the image samples to be trained, the target text area and the target fingertip position corresponding to the image samples to be trained.

The image sample to be trained may be an image containing the fingertip and the fingertip pointing to the text content, and the image whose fingertip position and text area have been identified may be used as the image sample to be trained, or after the image sample is obtained, After manually identifying the position of the fingertip and the text area, the image sample is used as the image sample to be trained. In other embodiments, the to-be-trained image samples may also be obtained in other ways, as long as the to-be-trained image samples have clearly corresponding target text regions and target fingertip positions.

Step S2042: Perform fingertip positioning and text detection on the to-be-trained image sample by using the to-be-trained fingertip regression and text detection integrated model to obtain the detected sample text area and the sample fingertip position.

Input the to-be-trained image sample into the to-be-trained fingertip regression and text detection integrated model, process the to-be-trained image sample through the to-be-trained fingertip regression and text detection integrated model, and obtain a sample of the to-be-trained sample image obtained by detecting it The fingertip position, and the sample text area corresponding to the sample fingertip position.

Step S2043: Calculate and determine the loss of the text region based on the sample text region corresponding to the image sample to be trained and the target text region.

The text area loss is a parameter used to characterize the difference between the sample text area of the image sample to be trained and the target text area, and is used to reflect the loss of the text area of the image sample to be trained during the detection process. The text area loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target text area and the sample text area can be used as the text area loss. In other embodiments, based on the target text area and the sample text area. For the text area, other methods can also be used to calculate the text area loss, such as mean square error and so on.

Step S2044: Calculate and determine the fingertip positioning loss based on the sample fingertip positions corresponding to the image samples to be trained and the target fingertip positions.

Fingertip positioning loss is a parameter used to characterize the difference between the sample fingertip position of the image sample to be trained and the target fingertip position, and is used to reflect the loss of fingertip positioning of the image sample to be trained during the detection process. The fingertip localization loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target fingertip position and the sample fingertip position can be used as the fingertip localization loss. Fingertip position and sample fingertip position, fingertip positioning loss can also be calculated in other ways, such as mean square error and so on.

Step S2045: Determine the model loss by combining the text area loss and the fingertip localization loss.

The model loss can be comprehensively determined by combining the text area loss and the fingertip localization loss. In some embodiments, the sum of the text area loss and the fingertip localization loss of each image sample to be trained may be used as the model loss. In other embodiments, after the weighted summation of the text area loss and the fingertip positioning loss of the image samples to be trained, the weighted summation losses of the image samples to be trained can be summed, and the sum obtained by the summation value as the model loss. In other embodiments, the model loss may also be determined by combining the text area loss and fingertip location loss of each image sample to be trained in other ways.

If it is determined that the model training end condition is satisfied based on the model loss obtained by the above calculation, the last training to be trained fingertip regression and text detection comprehensive model is used as the fingertip regression and text detection comprehensive model obtained by training. In some embodiments, when it is determined that the end condition of model training is met, the final training can be obtained by removing the output part of fingertip positioning on the basis of the comprehensive model of fingertip regression and text detection to be trained in the last training. The obtained comprehensive model of fingertip regression and text detection makes it possible to output only the text area in the final use.

The model training end condition may be set according to actual technical requirements. In some embodiments, it may be determined that the model training end condition is reached when the model loss is less than or equal to a preset loss amount and reaches a preset number of model iterations. In other embodiments, other manners may also be used to determine the model training end condition.

If it is determined based on the model loss obtained by the above calculation that the model training end condition is not met, the following step S2046 is entered.

Step S2046: Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained , until the model training end condition is reached.

When adjusting the comprehensive model of fingertip regression and text detection to be trained, model parameters related to fingertip positioning can be adjusted, and model parameters related to text area detection can also be adjusted. In some embodiments, the text area loss and the fingertip localization loss may be combined to determine how to adjust the model parameters. In other embodiments, other methods may also be used to determine the adjustment strategy for adjusting the model parameters.

Step S205: Identify the text content in the text area.

After the text area is obtained, the text content recognition may be performed on the text area to obtain the text content in the text area. Various possible ways can be used to recognize the text content in the text area, for example, OCR (Optical Character Recognition, optical character recognition) is used to recognize the text content. In other embodiments, other methods may also be used to identify the text content in the text area, which is not specifically limited in this embodiment of the present application.

Based on the method in the embodiment of the present application as described above, after the fingertip position is obtained by detecting the fingertip in the collected video frame image, the candidate area is intercepted based on the detected fingertip position, and the Further fingertip regression positioning, based on this, text detection is performed on the candidate area. After the text area is obtained, the text content in the text area is identified, so as to realize the text content recognition process of fingertip positioning. In this process, combining two The secondary fingertip positioning is used to finally realize the detection of the text area. For the first time, the candidate area is intercepted by high-speed fingertip positioning, and then the text area in the candidate area is detected by further fingertip regression positioning, which improves the efficiency of text content detection. On the basis of , the accuracy of the detected text content is improved.

Based on the method in the above-mentioned embodiment, a detailed description is given below in conjunction with one of the application examples.

Referring to FIG. 4 , in a specific example, when text content recognition is performed in the embodiment of the present application, a desktop video stream is first obtained by shooting with a camera, and the desktop video stream may include multiple frames of video frame images.

Then, high-speed fingertip positioning is performed on the desktop video stream, that is, the position of the fingertip in the video frame image is quickly and lightly identified, so that the high-speed fingertip detection and positioning can directly help the subsequent stability judgment and candidate region extraction.

Among them, for the fingertip recognition model used for high-speed fingertip positioning, a CNN model can be used in some embodiments, and the feature extraction backbone can be selected according to the characteristics of the business scenario (such as ResNet, MobileNet...), and the latter regression module uses dense Layer, loss loss using mean square error (Mean Square Error, MSE) and so on.

In the solution of the embodiment of the present application, high-speed fingertip positioning and detection is performed on the desktop video stream through the fingertip recognition model. Then, the regression target output by the fingertip recognition model is a preset regression position coordinate, and the preset regression position coordinate can be set as a coordinate that is not in the shooting range of the camera device, such as (-1, -1). If the fingertip is recognized in the desktop video stream, the regression target output by the fingertip recognition model is its real coordinates, that is, the position coordinates of the fingertip recognized by the fingertip recognition model (for example (123, 56)), that is, the finger is obtained. The fingertip position of the fingertip.

After the above fingertip position is obtained, because the user's fingertip may not yet be determined, the next step will be performed only by further judging the stability of the fingertip, and only when the fingertip is stable. In some embodiments, it can be determined by calculating the coordinate change of the position of the fingertip of the adjacent frames of the video, for example, if the position of the fingertip of a certain frame is compared with the previous preset number of adjacent video frames (for example, the previous 3 frames or more) The coordinate offset of adjacent video frames) is less than the preset offset, such as 10 pixel values (it should be understood that this value can be set in combination with the resolution adjustment of the camera device or other factors), it is considered that the index refers to The tip is stable. In some specific examples, the setting of the preset offset may be set in combination with the requirement for stable accuracy. If the stability requirements are relatively low, the preset offset can be set relatively large, such as 15 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 15 pixel values, the fingertip is considered to be stable. In this way, for the user, the user can move within a relatively large range, which helps to improve the user experience. If the stability requirements are relatively high, the preset offset can be set to a relatively small value, such as 5 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 5 pixel values, the fingertip is considered to be stable. In this way, for the user, it is required that the user can only move within a relatively small range. In the actual technical scenario, the resolution adjustment, user experience, stability accuracy, etc. of the camera device can be comprehensively considered.

After the fingertip is stabilized, around the fingertip position determined by the high-speed fingertip positioning, a certain area is intercepted as a candidate area. The first number of pixels and the second number of pixels (for example, 100 pixels, which can be determined according to the pixels captured by the device), in one embodiment, as shown in FIG. The position 51 is located in the center of the interception area 52 . When the expanded position is outside the video frame image, the clipping area is determined directly based on the boundary. In one embodiment, as shown in FIG. 5 , the area expanded outward based on the fingertip position 51 is shown as the dotted line box in FIG. 5 . At this time, based on the expanded boundary 61 expanded downward and the boundary expanded to the left 62. The up-expanded boundary 63 and the boundary 60 of the video frame image jointly determine a candidate region for interception. In other embodiments, the interception area may also be determined in other manners. By taking the candidate area immediately, the positioning accuracy of the secondary fingertip can be effectively improved, and the accuracy of text detection can be improved at the same time.

After obtaining the candidate regions extracted in the previous step, the integrated model of fingertip regression and text detection is further used to determine the text regions. Among them, the extraction of text content is related to both fingertip positioning and text detection, and the coordinates of the fingertip and the coordinates of the text content are highly correlated. Therefore, as shown in Figure 6, the scheme of this embodiment uses a multiple loss scheme to train A comprehensive model of fingertip regression and text detection is obtained to perform fingertip positioning and text detection, and output the text area corresponding to the identified fingertip position.

Then, the recognized text area is input into the text recognition model for recognition, so as to recognize the specific text content.

In some embodiments, taking word lookup as an example, after identifying the text content, you can further query the dictionary to obtain the definition of the identified text content, and display the obtained definition on the display screen of the computer device, as shown in the figure 1 shown.

The solution of this embodiment as described above fully considers the application scenario of the desktop learning device, firstly performs fast fingertip positioning, determines the interception area accordingly, and then performs more refined fingertip regression positioning and text area for the interception area. Detection, and then the recognition of the text content, the method from coarse to fine can not only reduce the overall calculation, but also better improve the accuracy of the regression coordinates; in addition, the fingertip positioning and text detection are combined into one model. The positioning accuracy can be further improved.

It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in these flowcharts may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution sequence of these steps or stages It is also not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of a step or phase within the other steps.

In one embodiment, as shown in FIG. 7, a text content recognition apparatus is provided, and the apparatus includes:

The image acquisition module 701 is configured to acquire the currently acquired video frame image;

A fingertip position detection module 702, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;

The candidate area interception module 703 is set to intercept the candidate area in the video frame image based on the position of the fingertip;

The fingertip regression and text area detection module 704 is configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain the text area in the candidate area;

The content recognition module 705 is configured to recognize the text content in the text area.

In some embodiments, the device further includes: a fingertip stability determination module configured to determine whether the fingertip is in a fingertip stable state based on the fingertip position;

The candidate area interception module 703 is configured to intercept the candidate area in the video frame image based on the position of the fingertip when the output result of the fingertip stable state determination module is that the fingertip is in the fingertip stable state. .

In some embodiments, the fingertip stability determination module, when the offset between the fingertip position and the fingertip position in the previous preset number of adjacent video frame images is less than a preset offset, determine the fingertip position. Fingertips are in fingertip stability.

In some embodiments, the fingertip position detection module 702 detects the fingertip in the video frame image through the fingertip recognition model obtained by training; the fingertip is detected by the fingertip recognition model and the detected When the fingertip is on the fingertip, output the coordinates of the detected position of the fingertip to obtain the fingertip position of the fingertip; when the fingertip recognition model does not detect the fingertip, output the preset return position coordinates, The preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.

In some embodiments, the candidate region intercepting module 703, taking the fingertip position as the center point, expands the first number of pixels to the two sides of the first coordinate axis respectively, and expands the second number of pixels to the two sides of the second coordinate axis respectively. The number of pixels; the candidate region in the video frame image is determined according to the regions formed by the expanded positions of the fingertips.

In some embodiments, the candidate region intercepting module 703 is configured to, when the position of the extended fingertip is located outside the boundary of the video frame image, select the extended fingertip located outside the boundary of the video frame image after the expansion. The boundary corresponding to the tip position and the region formed by other extended fingertip positions are determined as candidate regions in the video frame image.

In some embodiments, the fingertip regression and text area detection module 704 is configured to perform fingertip positioning and text detection on the candidate area through the comprehensive model of fingertip regression and text detection obtained by training, and obtain the text in the candidate area. area.

In some embodiments, it also includes: a comprehensive model training module for fingertip regression and text detection, configured to obtain image samples to be trained, as well as target text regions and target fingertip positions corresponding to the image samples to be trained; The integrated model of tip regression and text detection performs fingertip positioning and text detection on the image samples to be trained, and obtains the detected sample text area and the sample fingertip position; based on the sample text area and target text corresponding to the image samples to be trained area, calculate and determine the text area loss; calculate and determine the fingertip location loss based on the sample fingertip position corresponding to the image sample to be trained and the target fingertip location; combine the text area loss and the fingertip location loss to determine the model loss; Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained, until reaching Model training end condition.

For the specific limitation of the text content recognition device, reference may be made to the above limitation on the text content recognition method, which will not be repeated here. Each module in the above-mentioned text content recognition apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 8 . The computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein, the processor of the computing device is arranged to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is set to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The computer program, when executed by the processor, implements a text content recognition method. The display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory and a processor, a computer program is stored in the memory, and the processor implements the method in any of the above-described embodiments when the processor executes the computer program.

In one embodiment, there is provided a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method in any of the above-described embodiments.

In one embodiment, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the various embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

Claims

A text content recognition method, comprising:

Get the currently captured video frame image;

Detecting the fingertip in the video frame image, and obtaining the fingertip position of the fingertip;

Taking the position of the fingertip as a benchmark, intercepting the candidate area in the video frame image;

By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, the text region in the candidate region is obtained;

Text content in the text area is identified.
The method according to claim 1, after obtaining the fingertip position of the fingertip, using the fingertip position as a benchmark, before cutting out the candidate region in the video frame image, further comprising:

determining, based on the fingertip position, whether the fingertip is in a fingertip stable state;

When the fingertip is in a stable state of the fingertip, enter the step of intercepting the candidate region in the video frame image based on the fingertip position;

When the fingertip is not in the steady state of the fingertip, return to the step of detecting the fingertip in the video frame image.
The method of claim 2, wherein,

When the offset between the position of the fingertip and the position of the fingertip in the previous preset number of adjacent video frame images is less than a preset offset, it is determined that the fingertip is in a stable fingertip state.
The method according to claim 1, wherein detecting the fingertip in the video frame image to obtain the fingertip position of the finger, comprising:

The fingertip recognition model obtained by training detects the fingertips in the video frame images;

When the fingertip recognition model detects the fingertip, output the coordinates of the position of the detected fingertip to obtain the fingertip position of the fingertip;

When the fingertip recognition model does not detect the fingertip, output preset return position coordinates, where the preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.
The method according to claim 1, wherein, taking the fingertip position as a reference, intercepting a candidate region in the video frame image, comprising:

Taking the position of the fingertip as the center point, the first number of pixels are respectively extended to both sides of the first coordinate axis, and the second number of pixels are respectively extended to both sides of the second coordinate axis;

The candidate regions in the video frame image are determined according to the regions formed by the expanded positions of the fingertips.
The method according to claim 5, wherein determining the candidate region in the video frame image according to the region formed by the expanded positions of the fingertips comprises:

When the expanded fingertip position is outside the border of the video frame image, the border corresponding to the expanded fingertip position outside the border of the video frame image is formed with other expanded fingertip positions. The area of is determined as a candidate area in the video frame image.
The method according to claim 1, wherein, performing text detection on the candidate region by performing fingertip regression positioning on the candidate region to obtain the text region in the candidate region, comprising:

Fingertip localization and text detection are performed on the candidate region by the comprehensive model of fingertip regression and text detection obtained by training, and the text region in the candidate region is obtained.
The method according to claim 7, wherein the manner of obtaining the comprehensive model of fingertip regression and text detection by training comprises:

Obtaining image samples to be trained, as well as the target text area and target fingertip positions corresponding to the image samples to be trained;

Perform fingertip positioning and text detection on the to-be-trained image sample by using the to-be-trained fingertip regression and text detection integrated model to obtain the detected sample text area and the sample fingertip position;

Calculate and determine the loss of the text region based on the sample text region and the target text region corresponding to the image sample to be trained;

Calculate and determine the fingertip positioning loss based on the sample fingertip position corresponding to the image sample to be trained and the target fingertip position;

Combine the text region loss and fingertip localization loss to determine the model loss;

Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained, until reaching Model training end condition.
A text content recognition device, comprising:

an image acquisition module, set to acquire the currently acquired video frame image;

A fingertip position detection module, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;

A candidate region interception module, configured to intercept the candidate region in the video frame image based on the fingertip position;

A fingertip regression and text area detection module, configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain a text area in the candidate area;

A content recognition module, configured to recognize the text content in the text area.
A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the method of any one of claims 1 to 8 when the processor executes the computer program.
A computer-readable storage medium storing a computer program on the computer-readable storage medium, when the computer program is executed by a processor, implements the method of any one of claims 1 to 8.