WO2022206534A1 - Method and apparatus for text content recognition, computer device, and storage medium - Google Patents

Method and apparatus for text content recognition, computer device, and storage medium Download PDF

Info

Publication number
WO2022206534A1
WO2022206534A1 PCT/CN2022/082690 CN2022082690W WO2022206534A1 WO 2022206534 A1 WO2022206534 A1 WO 2022206534A1 CN 2022082690 W CN2022082690 W CN 2022082690W WO 2022206534 A1 WO2022206534 A1 WO 2022206534A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingertip
text
video frame
frame image
area
Prior art date
Application number
PCT/CN2022/082690
Other languages
French (fr)
Chinese (zh)
Inventor
林建民
Original Assignee
广州视源电子科技股份有限公司
广州视源人工智能创新研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司, 广州视源人工智能创新研究院有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2022206534A1 publication Critical patent/WO2022206534A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means

Definitions

  • the present application relates to the field of computer technology, for example, to a text content recognition method, apparatus, computer device and storage medium.
  • the fingertip word search technology based on desktop learning equipment has emerged. Specifically, the user points to an unknown target word through the finger, and the desktop learning device captures the picture of the finger pointing to the target word through the camera. point to the text content, and perform word query on the recognized text content to obtain the annotation content of the target word in the dictionary, and then display the target word obtained by the query and the annotation content of the target word on the display screen of the desktop learning device, thereby The user only needs to point to the target word to obtain the annotation content of the target word. In this process, the accuracy of the recognition of the target word is an important content that affects the result of the word query, but in the traditional way, the accuracy of the recognition of the target text is not high.
  • the present application provides a text content recognition method, device, computer equipment and storage medium capable of improving the accuracy of text recognition for fingertip search words.
  • the present application provides a text content recognition method, the method comprising:
  • Text content in the text area is identified.
  • the present application further provides a text content recognition device, the device comprising:
  • an image acquisition module set to acquire the currently acquired video frame image
  • a fingertip position detection module configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip
  • a candidate region interception module configured to intercept the candidate region in the video frame image based on the fingertip position
  • a fingertip regression and text area detection module configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain a text area in the candidate area;
  • a content recognition module configured to recognize the text content in the text area.
  • the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect when executing the computer program.
  • the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method described in the first aspect is implemented.
  • Fig. 1 is the application environment diagram of the text content recognition method in one embodiment
  • FIG. 2 is a schematic flowchart of a text content recognition method in one embodiment
  • FIG. 3 is a schematic flowchart of a text content recognition method in an application example
  • FIG. 4 is a schematic diagram of an interception candidate region in an application example
  • FIG. 5 is a schematic diagram of an interception candidate region in another application example
  • FIG. 6 is a schematic diagram of fingertip regression and text detection in an application example
  • FIG. 7 is a structural block diagram of a text content recognition device in one embodiment
  • FIG. 8 is an internal structural diagram of a computer device in one embodiment.
  • the text content recognition method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the computer device 20 is placed on the desktop 10 , and the computer device 20 carries or is externally connected with a camera device 30 .
  • the user places the text data 40 for finger reading on the desktop 10, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures this picture to obtain a video frame image, and the computer device 20 obtains the video. frame image, and process the video frame image to identify the text content pointed by the user's finger.
  • the computer device 20 that carries the camera device 30 by itself or is externally connected to the camera device 30 can also be fixed or placed in other positions.
  • the user When in use, the user holds the text material 40 with one hand or other devices and places it on the camera device 30 . Within the shooting range, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures the picture to obtain a video frame image, and the computer device 20 obtains the video frame image, and processes the video frame image to identify the video frame image.
  • the text content that the user's finger points to.
  • the computer device 20 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
  • a text content recognition method is provided, which is described by taking the method applied to the computer device 20 in FIG. 1 as an example, including the following steps S201 to S205 .
  • Step S201 Acquire a currently collected video frame image.
  • the video frame image can be obtained by capturing video with a camera device carried by the computer device or a camera device externally connected to the computer device.
  • Step S202 Detect the fingertip in the video frame image, and obtain the position of the fingertip of the finger.
  • the fingertip position may be the position of the detected fingertip in the video frame image.
  • the position of the fingertip can be obtained by detecting the video frame image through the fingertip recognition model obtained by pre-training, which may specifically include:
  • the fingertip recognition model obtained by training detects the fingertips in the video frame images
  • the fingertip recognition model When the fingertip recognition model detects the fingertip, output the coordinates of the position of the detected fingertip to obtain the fingertip position of the fingertip;
  • the fingertip recognition model when detected by the fingertip recognition model, when there is a fingertip in the video frame image, the coordinates of the position of the detected fingertip can be directly output, so as to obtain the fingertip position of the finger.
  • the fingertip is not detected in , at this time, the fingertip recognition model outputs the preset regression position coordinates, which avoids the logical error problem that the model cannot output the result when the fingertip recognition model detects that the fingertip is not recognized, so as to ensure that the fingertip recognition model cannot output the result.
  • the accuracy of the recognition results of the tip recognition model, and the availability of the recognition results of the fingertip recognition model are examples of the recognition results of the fingertip recognition model.
  • the manner of obtaining the fingertip recognition model by training can be performed in the following manner.
  • fingertip sample images some fingertip sample images have fingertips, and record the corresponding specific fingertip position, which is the target fingertip position of the fingertip sample image.
  • Some of the fingertip sample images do not have a fingertip, and the corresponding specific fingertip position can be set as the preset regression position coordinates.
  • the preset regression position coordinates are the target of the fingertip sample image. fingertip position.
  • the fingertip recognition model processes the input fingertip sample image to obtain the training result of the sample fingertip position of the fingertip.
  • the sample fingertip positions of the sample fingertip images of each finger are compared with the target fingertip positions of the fingertip images of the finger, and the model training error is calculated according to the comparison results.
  • the function for calculating the training error of the model may use any possible error function, such as mean square error, etc., which is not specifically limited in this embodiment of the present application.
  • the model training error meets the error requirement and the number of training iterations reaches the preset number of iterations, it is determined that the training end condition is met, and the fingertip recognition model to be trained for the last training is determined as the fingertip recognition model obtained by training. Otherwise, it is determined that the training end condition is not met, and after adjusting the model parameters of the fingertip recognition model to be trained, it returns to the step of inputting the fingertip sample image into the fingertip recognition model for processing until the model training end condition is reached.
  • the step may further include:
  • the subsequent processing flow is entered, which avoids the resource consumption of entering the subsequent flow when the fingertip is mistakenly detected and the recognition accuracy rate is low. question.
  • the preset offset may be set in combination with actual technical needs.
  • the preset offset may be based on the resolution of the video capture device (ie, the above-mentioned camera) that captures the video frame image.
  • rate setting for example, the preset offset is positively related to the resolution of the video capture device that captures the video frame image.
  • the preset offset may be set to 10 pixel values.
  • Step S203 Using the fingertip position as a reference, intercept a candidate region in the video frame image.
  • the fingertip position obtained above is used as a reference, and various possible ways can be used to cut out the candidate region in the video frame image.
  • taking the fingertip position as a reference intercepting a candidate region in the video frame image, including:
  • the first number of pixels are respectively extended to both sides of the first coordinate axis, and the second number of pixels are respectively extended to both sides of the second coordinate axis;
  • a region formed by the expanded fingertip positions is determined as a candidate region in the video frame image.
  • the first coordinate axis may be the axis where one side of the video frame image is located
  • the second coordinate axis may be the axis where the other side of the video frame image is located.
  • the first number of pixels and the second number of pixels may be the same or different.
  • the first number of pixels and the second number of pixels may be set to be the same, or may be set to be different.
  • the specific values of the first number of pixels and the second number of pixels can be determined in combination with the specific placement position of the computer device 20 as shown in FIG. 1 and the resolution and height of the camera device 30 .
  • the computer device 20 will have some common usage scenarios, such as being placed on a study desk, or it can be fixed at a specific position.
  • the computer device 20 usually It is to recognize the text content in various texts such as books, textbooks, or other printed texts with conventional printing sizes. Therefore, it can be combined with the computer equipment 20 in these common usage scenarios, the camera device 30 that it carries or is externally connected.
  • the first number of pixels and the second number of pixels are determined according to the rate, the vertical distance of the camera device 30 relative to the text content pointed by the finger, and the like. In other embodiments, the first number of pixels and the second number of pixels input by the user may also be obtained.
  • the region formed by the expanded fingertip positions is determined as a candidate region in the video frame image, including:
  • the border corresponding to the expanded fingertip position outside the border of the video frame image is formed with other expanded fingertip positions.
  • the area of is determined as a candidate area in the video frame image.
  • the boundary corresponding to the expanded fingertip position can be used as the basis for determining the candidate region, so as to improve the accuracy of the determined candidate region.
  • Step S204 By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, a text region in the candidate region is obtained.
  • the candidate region is further refined by fingertip positioning to obtain a more accurate
  • the position of the fingertip is determined, and the text area in the candidate area is determined accordingly, so as to improve the accuracy of the obtained text area.
  • the text region in the candidate region is obtained, including:
  • Fingertip localization and text detection are performed on the candidate region by the comprehensive model of fingertip regression and text detection obtained by training, and the text region in the candidate region is obtained.
  • the comprehensive model of fingertip regression and text detection can be a model structure that performs fingertip positioning and text area detection at the same time. Adjustment.
  • the method for obtaining the comprehensive model of fingertip regression and text detection by training includes the following steps S2041 to S2046.
  • Step S2041 Acquire the image samples to be trained, the target text area and the target fingertip position corresponding to the image samples to be trained.
  • the image sample to be trained may be an image containing the fingertip and the fingertip pointing to the text content, and the image whose fingertip position and text area have been identified may be used as the image sample to be trained, or after the image sample is obtained, After manually identifying the position of the fingertip and the text area, the image sample is used as the image sample to be trained.
  • the to-be-trained image samples may also be obtained in other ways, as long as the to-be-trained image samples have clearly corresponding target text regions and target fingertip positions.
  • Step S2042 Perform fingertip positioning and text detection on the to-be-trained image sample by using the to-be-trained fingertip regression and text detection integrated model to obtain the detected sample text area and the sample fingertip position.
  • Step S2043 Calculate and determine the loss of the text region based on the sample text region corresponding to the image sample to be trained and the target text region.
  • the text area loss is a parameter used to characterize the difference between the sample text area of the image sample to be trained and the target text area, and is used to reflect the loss of the text area of the image sample to be trained during the detection process.
  • the text area loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target text area and the sample text area can be used as the text area loss. In other embodiments, based on the target text area and the sample text area. For the text area, other methods can also be used to calculate the text area loss, such as mean square error and so on.
  • Step S2044 Calculate and determine the fingertip positioning loss based on the sample fingertip positions corresponding to the image samples to be trained and the target fingertip positions.
  • Fingertip positioning loss is a parameter used to characterize the difference between the sample fingertip position of the image sample to be trained and the target fingertip position, and is used to reflect the loss of fingertip positioning of the image sample to be trained during the detection process.
  • the fingertip localization loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target fingertip position and the sample fingertip position can be used as the fingertip localization loss. Fingertip position and sample fingertip position, fingertip positioning loss can also be calculated in other ways, such as mean square error and so on.
  • Step S2045 Determine the model loss by combining the text area loss and the fingertip localization loss.
  • the model loss can be comprehensively determined by combining the text area loss and the fingertip localization loss.
  • the sum of the text area loss and the fingertip localization loss of each image sample to be trained may be used as the model loss.
  • the weighted summation losses of the image samples to be trained can be summed, and the sum obtained by the summation value as the model loss.
  • the model loss may also be determined by combining the text area loss and fingertip location loss of each image sample to be trained in other ways.
  • the last training to be trained fingertip regression and text detection comprehensive model is used as the fingertip regression and text detection comprehensive model obtained by training.
  • the final training can be obtained by removing the output part of fingertip positioning on the basis of the comprehensive model of fingertip regression and text detection to be trained in the last training. The obtained comprehensive model of fingertip regression and text detection makes it possible to output only the text area in the final use.
  • the model training end condition may be set according to actual technical requirements. In some embodiments, it may be determined that the model training end condition is reached when the model loss is less than or equal to a preset loss amount and reaches a preset number of model iterations. In other embodiments, other manners may also be used to determine the model training end condition.
  • step S2046 is entered.
  • Step S2046 Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained , until the model training end condition is reached.
  • model parameters related to fingertip positioning can be adjusted, and model parameters related to text area detection can also be adjusted.
  • the text area loss and the fingertip localization loss may be combined to determine how to adjust the model parameters.
  • other methods may also be used to determine the adjustment strategy for adjusting the model parameters.
  • Step S205 Identify the text content in the text area.
  • the text content recognition may be performed on the text area to obtain the text content in the text area.
  • Various possible ways can be used to recognize the text content in the text area, for example, OCR (Optical Character Recognition, optical character recognition) is used to recognize the text content.
  • OCR Optical Character Recognition, optical character recognition
  • other methods may also be used to identify the text content in the text area, which is not specifically limited in this embodiment of the present application.
  • the candidate area is intercepted based on the detected fingertip position, and the Further fingertip regression positioning, based on this, text detection is performed on the candidate area.
  • the text area is obtained, the text content in the text area is identified, so as to realize the text content recognition process of fingertip positioning. In this process, combining two The secondary fingertip positioning is used to finally realize the detection of the text area.
  • the candidate area is intercepted by high-speed fingertip positioning, and then the text area in the candidate area is detected by further fingertip regression positioning, which improves the efficiency of text content detection. On the basis of , the accuracy of the detected text content is improved.
  • a desktop video stream is first obtained by shooting with a camera, and the desktop video stream may include multiple frames of video frame images.
  • high-speed fingertip positioning is performed on the desktop video stream, that is, the position of the fingertip in the video frame image is quickly and lightly identified, so that the high-speed fingertip detection and positioning can directly help the subsequent stability judgment and candidate region extraction.
  • a CNN model can be used in some embodiments, and the feature extraction backbone can be selected according to the characteristics of the business scenario (such as ResNet, MobileNet%), and the latter regression module uses dense Layer, loss loss using mean square error (Mean Square Error, MSE) and so on.
  • MSE mean square Error
  • the regression target output by the fingertip recognition model is a preset regression position coordinate, and the preset regression position coordinate can be set as a coordinate that is not in the shooting range of the camera device, such as (-1, -1).
  • the regression target output by the fingertip recognition model is its real coordinates, that is, the position coordinates of the fingertip recognized by the fingertip recognition model (for example (123, 56)), that is, the finger is obtained.
  • the fingertip position of the fingertip is the real coordinates, that is, the position coordinates of the fingertip recognized by the fingertip recognition model (for example (123, 56)), that is, the finger is obtained.
  • the next step will be performed only by further judging the stability of the fingertip, and only when the fingertip is stable.
  • it can be determined by calculating the coordinate change of the position of the fingertip of the adjacent frames of the video, for example, if the position of the fingertip of a certain frame is compared with the previous preset number of adjacent video frames (for example, the previous 3 frames or more) The coordinate offset of adjacent video frames) is less than the preset offset, such as 10 pixel values (it should be understood that this value can be set in combination with the resolution adjustment of the camera device or other factors), it is considered that the index refers to The tip is stable.
  • the setting of the preset offset may be set in combination with the requirement for stable accuracy. If the stability requirements are relatively low, the preset offset can be set relatively large, such as 15 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 15 pixel values, the fingertip is considered to be stable. In this way, for the user, the user can move within a relatively large range, which helps to improve the user experience.
  • the preset offset can be set to a relatively small value, such as 5 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 5 pixel values, the fingertip is considered to be stable. In this way, for the user, it is required that the user can only move within a relatively small range. In the actual technical scenario, the resolution adjustment, user experience, stability accuracy, etc. of the camera device can be comprehensively considered.
  • the first number of pixels and the second number of pixels (for example, 100 pixels, which can be determined according to the pixels captured by the device), in one embodiment, as shown in FIG.
  • the position 51 is located in the center of the interception area 52 .
  • the clipping area is determined directly based on the boundary.
  • the area expanded outward based on the fingertip position 51 is shown as the dotted line box in FIG. 5 .
  • the expanded boundary 61 expanded downward and the boundary expanded to the left 62.
  • the up-expanded boundary 63 and the boundary 60 of the video frame image jointly determine a candidate region for interception.
  • the interception area may also be determined in other manners.
  • the integrated model of fingertip regression and text detection is further used to determine the text regions.
  • the extraction of text content is related to both fingertip positioning and text detection, and the coordinates of the fingertip and the coordinates of the text content are highly correlated. Therefore, as shown in Figure 6, the scheme of this embodiment uses a multiple loss scheme to train A comprehensive model of fingertip regression and text detection is obtained to perform fingertip positioning and text detection, and output the text area corresponding to the identified fingertip position.
  • the recognized text area is input into the text recognition model for recognition, so as to recognize the specific text content.
  • taking word lookup as an example after identifying the text content, you can further query the dictionary to obtain the definition of the identified text content, and display the obtained definition on the display screen of the computer device, as shown in the figure 1 shown.
  • the solution of this embodiment as described above fully considers the application scenario of the desktop learning device, firstly performs fast fingertip positioning, determines the interception area accordingly, and then performs more refined fingertip regression positioning and text area for the interception area. Detection, and then the recognition of the text content, the method from coarse to fine can not only reduce the overall calculation, but also better improve the accuracy of the regression coordinates; in addition, the fingertip positioning and text detection are combined into one model. The positioning accuracy can be further improved.
  • a text content recognition apparatus includes:
  • the image acquisition module 701 is configured to acquire the currently acquired video frame image
  • a fingertip position detection module 702 configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;
  • the candidate area interception module 703 is set to intercept the candidate area in the video frame image based on the position of the fingertip;
  • the fingertip regression and text area detection module 704 is configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain the text area in the candidate area;
  • the content recognition module 705 is configured to recognize the text content in the text area.
  • the device further includes: a fingertip stability determination module configured to determine whether the fingertip is in a fingertip stable state based on the fingertip position;
  • the candidate area interception module 703 is configured to intercept the candidate area in the video frame image based on the position of the fingertip when the output result of the fingertip stable state determination module is that the fingertip is in the fingertip stable state. .
  • the fingertip stability determination module when the offset between the fingertip position and the fingertip position in the previous preset number of adjacent video frame images is less than a preset offset, determine the fingertip position. Fingertips are in fingertip stability.
  • the fingertip position detection module 702 detects the fingertip in the video frame image through the fingertip recognition model obtained by training; the fingertip is detected by the fingertip recognition model and the detected When the fingertip is on the fingertip, output the coordinates of the detected position of the fingertip to obtain the fingertip position of the fingertip; when the fingertip recognition model does not detect the fingertip, output the preset return position coordinates,
  • the preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.
  • the candidate region intercepting module 703 taking the fingertip position as the center point, expands the first number of pixels to the two sides of the first coordinate axis respectively, and expands the second number of pixels to the two sides of the second coordinate axis respectively.
  • the number of pixels; the candidate region in the video frame image is determined according to the regions formed by the expanded positions of the fingertips.
  • the candidate region intercepting module 703 is configured to, when the position of the extended fingertip is located outside the boundary of the video frame image, select the extended fingertip located outside the boundary of the video frame image after the expansion.
  • the boundary corresponding to the tip position and the region formed by other extended fingertip positions are determined as candidate regions in the video frame image.
  • the fingertip regression and text area detection module 704 is configured to perform fingertip positioning and text detection on the candidate area through the comprehensive model of fingertip regression and text detection obtained by training, and obtain the text in the candidate area. area.
  • a comprehensive model training module for fingertip regression and text detection configured to obtain image samples to be trained, as well as target text regions and target fingertip positions corresponding to the image samples to be trained;
  • the integrated model of tip regression and text detection performs fingertip positioning and text detection on the image samples to be trained, and obtains the detected sample text area and the sample fingertip position; based on the sample text area and target text corresponding to the image samples to be trained area, calculate and determine the text area loss; calculate and determine the fingertip location loss based on the sample fingertip position corresponding to the image sample to be trained and the target fingertip location; combine the text area loss and the fingertip location loss to determine the model loss; Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained, until reaching Model training end condition.
  • Each module in the above-mentioned text content recognition apparatus may be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 8 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus.
  • the processor of the computing device is arranged to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is set to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, an operator network, Near Field Communication (NFC) or other technologies.
  • the computer program when executed by the processor, implements a text content recognition method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device comprising a memory and a processor, a computer program is stored in the memory, and the processor implements the method in any of the above-described embodiments when the processor executes the computer program.
  • a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method in any of the above-described embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Abstract

A method and apparatus for text content recognition, a computer device, and a storage medium, comprising: acquiring a video frame image currently collected (S201); detecting a fingertip in the video frame image and acquiring the tip position of the fingertip (S202); with the tip position serving as a reference, capturing a candidate area in the video frame image (S203); by means of tip regression positioning performed with respect to the candidate area and text detection performed with respect to the candidate area, acquiring a text area in the candidate area (S204); and recognizing a text content in the text area (S205).

Description

文本内容识别方法、装置、计算机设备和存储介质Text content recognition method, device, computer equipment and storage medium
本申请要求在2021年3月29日提交中国专利局、申请号为202110336251.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202110336251.7 filed with the China Patent Office on March 29, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及计算机技术领域,例如涉及一种文本内容识别方法、装置、计算机设备和存储介质。The present application relates to the field of computer technology, for example, to a text content recognition method, apparatus, computer device and storage medium.
背景技术Background technique
随着计算机技术的发展,出现了基于桌面学习设备的指尖查词技术,具体是用户通过手指指向不认识的目标单词,桌面学习设备通过摄像头拍摄到手指指向目标单词的画面,通过识别出手指指向的文本内容,并对识别到的文本内容进行单词查询,获得词典中对目标单词的注释内容,然后将查询获得的目标单词以及目标单词的注释内容在桌面学习设备的显示屏上显示,从而使得用户只需要指向目标单词就可以获得目标单词的注释内容。在这个过程中,对目标单词的识别的准确性是影响单词查询结果的重要内容,但传统方式中,对目标文本的识别的准确性不高。With the development of computer technology, the fingertip word search technology based on desktop learning equipment has emerged. Specifically, the user points to an unknown target word through the finger, and the desktop learning device captures the picture of the finger pointing to the target word through the camera. point to the text content, and perform word query on the recognized text content to obtain the annotation content of the target word in the dictionary, and then display the target word obtained by the query and the annotation content of the target word on the display screen of the desktop learning device, thereby The user only needs to point to the target word to obtain the annotation content of the target word. In this process, the accuracy of the recognition of the target word is an important content that affects the result of the word query, but in the traditional way, the accuracy of the recognition of the target text is not high.
发明内容SUMMARY OF THE INVENTION
本申请提供一种能够提高指尖查词的文本识别的准确率的文本内容识别方法、装置、计算机设备和存储介质。The present application provides a text content recognition method, device, computer equipment and storage medium capable of improving the accuracy of text recognition for fingertip search words.
第一方面,本申请提供一种文本内容识别方法,所述方法包括:In a first aspect, the present application provides a text content recognition method, the method comprising:
获取当前采集的视频帧图像;Get the currently captured video frame image;
检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置;Detecting the fingertip in the video frame image, and obtaining the fingertip position of the fingertip;
以所述指尖位置为基准,截取所述视频帧图像中的候选区域;Taking the position of the fingertip as a benchmark, intercepting the candidate area in the video frame image;
通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测, 获得所述候选区域中的文本区域;By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, a text region in the candidate region is obtained;
识别所述文本区域中的文本内容。Text content in the text area is identified.
第二方面,本申请还提供一种文本内容识别装置,所述装置包括:In a second aspect, the present application further provides a text content recognition device, the device comprising:
图像采集模块,设置为获取当前采集的视频帧图像;an image acquisition module, set to acquire the currently acquired video frame image;
指尖位置检测模块,设置为检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置;A fingertip position detection module, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;
候选区域截取模块,设置为以所述指尖位置为基准,截取所述视频帧图像中的候选区域;A candidate region interception module, configured to intercept the candidate region in the video frame image based on the fingertip position;
指尖回归和文本区域检测模块,设置为通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域;A fingertip regression and text area detection module, configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain a text area in the candidate area;
内容识别模块,设置为识别所述文本区域中的文本内容。A content recognition module, configured to recognize the text content in the text area.
第三方面,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现第一方面所述的方法。In a third aspect, the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect when executing the computer program.
第四方面,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现第一方面所述的方法。In a fourth aspect, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method described in the first aspect is implemented.
附图说明Description of drawings
图1为一个实施例中的文本内容识别方法的应用环境图;Fig. 1 is the application environment diagram of the text content recognition method in one embodiment;
图2为一个实施例中的文本内容识别方法的流程示意图;2 is a schematic flowchart of a text content recognition method in one embodiment;
图3为一个应用示例中的文本内容识别方法的流程示意图;3 is a schematic flowchart of a text content recognition method in an application example;
图4为一个应用示例中的截取候选区域的示意图;FIG. 4 is a schematic diagram of an interception candidate region in an application example;
图5为另一个应用示例中的截取候选区域的示意图;FIG. 5 is a schematic diagram of an interception candidate region in another application example;
图6为一个应用示例中的指尖回归与文本检测的示意图;6 is a schematic diagram of fingertip regression and text detection in an application example;
图7为一个实施例中的文本内容识别装置的结构框图;7 is a structural block diagram of a text content recognition device in one embodiment;
图8为一个实施例中的计算机设备的内部结构图。FIG. 8 is an internal structural diagram of a computer device in one embodiment.
具体实施方式Detailed ways
以下结合附图及实施例,对本申请进行详细说明。The present application will be described in detail below with reference to the accompanying drawings and embodiments.
本申请提供的文本内容识别方法,可以应用于如图1所示的应用环境中。其中,计算机设备20放置在桌面10上,计算机设备20自身携带或者外接有摄像装置30。使用时,用户将进行指读的文本资料40放置在桌面10上,用户通过手指指向需要进行识别或者进行翻译的文本内容,摄像装置30拍摄到这个画面得到视频帧图像,计算机设备20获得该视频帧图像,并对该视频帧图像进行处理,识别出用户手指指向的文本内容。在一些实施例中,自身携带或者外接有摄像装置30的计算机设备20,也可以固定或者放置于其他位置,使用时,用户用一只手或者其他设备持有文本资料40放置在摄像装置30的拍摄范围内,用户通过手指指向需要进行识别或者进行翻译的文本内容,摄像装置30拍摄到这个画面得到视频帧图像,计算机设备20获得该视频帧图像,并对该视频帧图像进行处理,识别出用户手指指向的文本内容。The text content recognition method provided in this application can be applied to the application environment shown in FIG. 1 . The computer device 20 is placed on the desktop 10 , and the computer device 20 carries or is externally connected with a camera device 30 . When in use, the user places the text data 40 for finger reading on the desktop 10, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures this picture to obtain a video frame image, and the computer device 20 obtains the video. frame image, and process the video frame image to identify the text content pointed by the user's finger. In some embodiments, the computer device 20 that carries the camera device 30 by itself or is externally connected to the camera device 30 can also be fixed or placed in other positions. When in use, the user holds the text material 40 with one hand or other devices and places it on the camera device 30 . Within the shooting range, the user points to the text content that needs to be recognized or translated by the finger, the camera 30 captures the picture to obtain a video frame image, and the computer device 20 obtains the video frame image, and processes the video frame image to identify the video frame image. The text content that the user's finger points to.
在识别出用户手指指向的文本内容之后,在处于查词的应用场景下,对识别到的文本内容进行单词查询,获得词典中对目标单词的注释内容,然后将查询获得的目标单词以及目标单词的注释内容在计算机设备20的显示屏上显示,如图1所示。在处于检索的应用场景下,则对识别到的文本内容进行检索,并将获得的检索结果在计算机设备20的显示屏上显示。在其他的应用场景下,在识别到文本内容后,也可以做其他的进一步的处理。其中,计算机设备20可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。After identifying the text content pointed by the user's finger, in the application scenario of word lookup, perform word query on the recognized text content, obtain the annotation content of the target word in the dictionary, and then query the obtained target word and target word. The content of the annotation is displayed on the display screen of the computer device 20 , as shown in FIG. 1 . In the application scenario of retrieval, the recognized text content is retrieved, and the retrieved result is displayed on the display screen of the computer device 20 . In other application scenarios, after the text content is recognized, other further processing can also be performed. Among them, the computer device 20 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
在一个实施例中,如图2所示,提供了一种文本内容识别方法,以该方法应用于图1中的计算机设备20为例进行说明,包括以下步骤S201至步骤S205。In one embodiment, as shown in FIG. 2 , a text content recognition method is provided, which is described by taking the method applied to the computer device 20 in FIG. 1 as an example, including the following steps S201 to S205 .
步骤S201:获取当前采集的视频帧图像。Step S201: Acquire a currently collected video frame image.
可以计算机设备自身携带的摄像装置或者该计算机设备外接的摄像装置,进行视频拍摄获得视频帧图像。The video frame image can be obtained by capturing video with a camera device carried by the computer device or a camera device externally connected to the computer device.
步骤S202:检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置。Step S202: Detect the fingertip in the video frame image, and obtain the position of the fingertip of the finger.
检测视频帧图像中的手指指尖时,可以用各种可能的方式来检测出视频帧图像中的手指指尖的指尖位置。该指尖位置具体可以是检测出手指指尖在视频帧图像中的位置。When detecting the fingertip in the video frame image, various possible ways can be used to detect the fingertip position of the fingertip in the video frame image. Specifically, the fingertip position may be the position of the detected fingertip in the video frame image.
在本申请的一些实施例中,可以通过预先训练获得的指尖识别模型对视频帧图像进行检测得到指尖位置,具体可以包括:In some embodiments of the present application, the position of the fingertip can be obtained by detecting the video frame image through the fingertip recognition model obtained by pre-training, which may specifically include:
通过训练获得的指尖识别模型检测所述视频帧图像中的手指指尖;The fingertip recognition model obtained by training detects the fingertips in the video frame images;
在所述指尖识别模型检测到手指指尖时,输出检测到的手指指尖的位置的坐标,获得所述手指指尖的指尖位置;When the fingertip recognition model detects the fingertip, output the coordinates of the position of the detected fingertip to obtain the fingertip position of the fingertip;
在所述指尖识别模型未检测到手指指尖时,输出预设回归位置坐标,所述预设回归位置坐标为不属于所述视频帧图像的坐标范围的坐标。When the fingertip recognition model does not detect the fingertip, output preset return position coordinates, where the preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.
从而,通过指尖识别模型检测时,在视频帧图像中存在手指指尖时,可以直接输出检测到的手指指尖的位置的坐标,从而获得手指指尖的指尖位置,如果在视频帧图像中没有检测到手指指尖,此时指尖识别模型输出的是预设回归位置坐标,避免了指尖识别模型检测未识别到手指指尖时,模型无法输出结果的逻辑错误问题,以确保指尖识别模型识别结果的准确性,以及指尖识别模型的识别结果的可用性。Therefore, when detected by the fingertip recognition model, when there is a fingertip in the video frame image, the coordinates of the position of the detected fingertip can be directly output, so as to obtain the fingertip position of the finger. The fingertip is not detected in , at this time, the fingertip recognition model outputs the preset regression position coordinates, which avoids the logical error problem that the model cannot output the result when the fingertip recognition model detects that the fingertip is not recognized, so as to ensure that the fingertip recognition model cannot output the result. The accuracy of the recognition results of the tip recognition model, and the availability of the recognition results of the fingertip recognition model.
一些实施例中,训练得到指尖识别模型的方式可以采用下述方式进行。In some embodiments, the manner of obtaining the fingertip recognition model by training can be performed in the following manner.
首先,获取手指指尖样本图像,其中一些手指指尖样本图像中存在手指指尖,并记录有对应的具体的手指指尖的位置,该位置为手指指尖样本图像的目标指尖位置。其中有一些手指指尖样本图像不存在手指指尖,其对应的具体的手指指尖的位置可以设定为预设回归位置坐标,此时预设回归位置坐标为该手指指尖样本图像的目标指尖位置。First, obtain fingertip sample images, some fingertip sample images have fingertips, and record the corresponding specific fingertip position, which is the target fingertip position of the fingertip sample image. Some of the fingertip sample images do not have a fingertip, and the corresponding specific fingertip position can be set as the preset regression position coordinates. At this time, the preset regression position coordinates are the target of the fingertip sample image. fingertip position.
然后,将手指指尖样本图像输入到指尖识别模型进行处理,指尖识别模型对输入的手指指尖样本图像进行处理,获得手指指尖的样本指尖位置的训练结 果。Then, input the fingertip sample image to the fingertip recognition model for processing, and the fingertip recognition model processes the input fingertip sample image to obtain the training result of the sample fingertip position of the fingertip.
将各手指指尖样本图像的样本指尖位置与该手指指尖图像的目标指尖位置进行比较,并根据比较结果计算模型训练误差。计算模型训练误差的函数可以采用任何可能的误差函数,例如均方误差等等,本申请实施例不做具体限定。The sample fingertip positions of the sample fingertip images of each finger are compared with the target fingertip positions of the fingertip images of the finger, and the model training error is calculated according to the comparison results. The function for calculating the training error of the model may use any possible error function, such as mean square error, etc., which is not specifically limited in this embodiment of the present application.
若模型训练误差满足误差需求,且训练迭代次数达到预设迭代次数时,确定满足训练结束条件,并将最后一次训练的待训练指尖识别模型,确定为训练得到的指尖识别模型。否则,则判定不满足训练结束条件,对待训练指尖识别模型的模型参数进行调整后,返回将手指指尖样本图像输入到指尖识别模型进行处理的步骤,直至达到模型训练结束条件。If the model training error meets the error requirement and the number of training iterations reaches the preset number of iterations, it is determined that the training end condition is met, and the fingertip recognition model to be trained for the last training is determined as the fingertip recognition model obtained by training. Otherwise, it is determined that the training end condition is not met, and after adjusting the model parameters of the fingertip recognition model to be trained, it returns to the step of inputting the fingertip sample image into the fingertip recognition model for processing until the model training end condition is reached.
一些实施例中,在获得手指指尖的指尖位置之后,还可以包括步骤:In some embodiments, after obtaining the fingertip position of the fingertip, the step may further include:
基于所述指尖位置,确定所述手指指尖是否处于指尖稳定状态;determining, based on the fingertip position, whether the fingertip is in a fingertip stable state;
在所述手指指尖处于指尖稳定状态时,进入以所述指尖位置为基准,截取所述视频帧图像中的候选区域的步骤;When the fingertip is in a stable state of the fingertip, enter the step of intercepting the candidate region in the video frame image based on the fingertip position;
在所述手指指尖不是处于指尖稳定状态时,返回检测所述视频帧图像中的手指指尖的步骤。When the fingertip is not in the steady state of the fingertip, return to the step of detecting the fingertip in the video frame image.
从而,在检测到手指指尖的指尖位置,而且手指指尖处于稳定状态后,才进入后续的处理流程,避免了误检测到手指指尖时进入后续流程的资源消耗以及识别准确率低的问题。Therefore, after the position of the fingertip of the fingertip is detected and the fingertip is in a stable state, the subsequent processing flow is entered, which avoids the resource consumption of entering the subsequent flow when the fingertip is mistakenly detected and the recognition accuracy rate is low. question.
其中,在一些实施例中,可以是在所述指尖位置与前预设数目的相邻视频帧图像中的手指指尖位置的偏移量小于预设偏移量时,确定所述手指指尖处于指尖稳定状态。其中,预设偏移量可以结合实际技术需要进行设定,在本申请的一些实施例中,预设偏移量可以基于采集所述视频帧图像的视频采集设备(即上述摄像装置)的分辨率设定,例如所述预设偏移量与采集所述视频帧图像的视频采集设备的分辨率正相关。在一些具体示例中,所述预设偏移量可以设置为10个像素值。Wherein, in some embodiments, when the offset between the fingertip position and the fingertip position in the previous preset number of adjacent video frame images is less than a preset offset, it may be determined that the finger The tip is in fingertip stability. The preset offset may be set in combination with actual technical needs. In some embodiments of the present application, the preset offset may be based on the resolution of the video capture device (ie, the above-mentioned camera) that captures the video frame image. rate setting, for example, the preset offset is positively related to the resolution of the video capture device that captures the video frame image. In some specific examples, the preset offset may be set to 10 pixel values.
步骤S203:以所述指尖位置为基准,截取所述视频帧图像中的候选区域。Step S203: Using the fingertip position as a reference, intercept a candidate region in the video frame image.
上述获得的指尖位置为基准,可以采用各种可能的方式截取出视频帧图像中的候选区域。The fingertip position obtained above is used as a reference, and various possible ways can be used to cut out the candidate region in the video frame image.
在一些实施例中,以所述指尖位置为基准,截取所述视频帧图像中的候选区域,包括:In some embodiments, taking the fingertip position as a reference, intercepting a candidate region in the video frame image, including:
以所述指尖位置为中心点,向第一坐标轴的两侧分别扩展第一数目像素,向与第二坐标轴的两侧分别扩展第二数目像素;Taking the position of the fingertip as the center point, the first number of pixels are respectively extended to both sides of the first coordinate axis, and the second number of pixels are respectively extended to both sides of the second coordinate axis;
将扩展后的各指尖位置形成的区域确定为所述视频帧图像中的候选区域。A region formed by the expanded fingertip positions is determined as a candidate region in the video frame image.
其中,第一坐标轴可以是视频帧图像的其中一条边所在的轴,第二坐标轴可以是视频帧图像的另一条边所在的轴。第一数目像素和第二数目像素,可以设置为相同,也可以设置为不同。Wherein, the first coordinate axis may be the axis where one side of the video frame image is located, and the second coordinate axis may be the axis where the other side of the video frame image is located. The first number of pixels and the second number of pixels may be the same or different.
其中第一数目像素和第二数目像素可以设置为相同,也可以设置为不同。第一数目像素和第二数目像素的具体数值,可以结合如图1所示的计算机设备20的具体放置位置以及摄像装置30的分辨率以及所处的高度确定。通常情况下,用户在使用计算机设备20时,计算机设备20会有一些较为常用的使用场景,例如放置在学习桌上使用,或者可以固定在某个具体的位置,其在识别文本内容时,通常是对具有常规印刷尺寸的书本、课本、或者其他印刷文本等各种文本中的文本内容进行识别,因此,可以结合计算机设备20在这些常用使用场景下,其携带或者外接的摄像装置30的分辨率、摄像装置30相对于手指指向的文本内容的垂直距离等,确定出第一数目像素和第二数目像素。在另一些实施例中,也可以通过获取用户输入的第一数目像素和第二数目像素。The first number of pixels and the second number of pixels may be set to be the same, or may be set to be different. The specific values of the first number of pixels and the second number of pixels can be determined in combination with the specific placement position of the computer device 20 as shown in FIG. 1 and the resolution and height of the camera device 30 . Usually, when a user uses the computer device 20, the computer device 20 will have some common usage scenarios, such as being placed on a study desk, or it can be fixed at a specific position. When recognizing text content, the computer device 20 usually It is to recognize the text content in various texts such as books, textbooks, or other printed texts with conventional printing sizes. Therefore, it can be combined with the computer equipment 20 in these common usage scenarios, the camera device 30 that it carries or is externally connected. The first number of pixels and the second number of pixels are determined according to the rate, the vertical distance of the camera device 30 relative to the text content pointed by the finger, and the like. In other embodiments, the first number of pixels and the second number of pixels input by the user may also be obtained.
一些实施例中,将扩展后的各指尖位置形成的区域确定为所述视频帧图像中的候选区域,包括:In some embodiments, the region formed by the expanded fingertip positions is determined as a candidate region in the video frame image, including:
当扩展后的指尖位置位于所述视频帧图像的边界外时,将扩展后位于所述视频帧图像的边界外的扩展后的指尖位置对应的边界、与其他扩展后的指尖位置形成的区域确定为所述视频帧图像中的候选区域。When the expanded fingertip position is outside the border of the video frame image, the border corresponding to the expanded fingertip position outside the border of the video frame image is formed with other expanded fingertip positions. The area of is determined as a candidate area in the video frame image.
从而,当扩展后的指尖位置处于视频帧图像的边界外时,可以以扩展后的 指尖位置对应的边界,来作为确定候选区域的基础,提升确定的候选区域的准确性。Therefore, when the expanded fingertip position is outside the boundary of the video frame image, the boundary corresponding to the expanded fingertip position can be used as the basis for determining the candidate region, so as to improve the accuracy of the determined candidate region.
步骤S204:通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域。Step S204: By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, a text region in the candidate region is obtained.
通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得候选区域中的文本区域时,是通过进一步对候选区域进行更精细化的指尖定位,以获得更精确的指尖位置,并据此确定候选区域中的文本区域,从而提高获得的文本区域的准确性。By performing fingertip regression positioning on the candidate region, and text detection on the candidate region, when obtaining the text region in the candidate region, the candidate region is further refined by fingertip positioning to obtain a more accurate The position of the fingertip is determined, and the text area in the candidate area is determined accordingly, so as to improve the accuracy of the obtained text area.
一些实施例中,通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域,包括:In some embodiments, by performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, the text region in the candidate region is obtained, including:
通过训练获得的指尖回归与文本检测综合模型对所述候选区域进行指尖定位和文本检测,获得所述候选区域中的文本区域。Fingertip localization and text detection are performed on the candidate region by the comprehensive model of fingertip regression and text detection obtained by training, and the text region in the candidate region is obtained.
其中,指尖回归与文本检测综合模型可以是同时进行指尖定位以及文本区域检测的模型结构,其以指尖定位的指尖定位误差和文本区域检测的文本区域误差的多重误差来对模型进行调整。Among them, the comprehensive model of fingertip regression and text detection can be a model structure that performs fingertip positioning and text area detection at the same time. Adjustment.
一些实施例中,训练获得所述指尖回归与文本检测综合模型的方式包括如下步骤S2041至步骤S2046。In some embodiments, the method for obtaining the comprehensive model of fingertip regression and text detection by training includes the following steps S2041 to S2046.
步骤S2041:获取待训练图像样本,以及与所述待训练图像样本对应的目标文本区域以及目标指尖位置。Step S2041: Acquire the image samples to be trained, the target text area and the target fingertip position corresponding to the image samples to be trained.
待训练图像样本可以是包含有手指指尖以及手指指尖指向了文本内容的图像,可以将已经识别出指尖位置和文本区域的图像作为待训练图像样本,也可以是在获得图像样本后,通过人工识别出指尖位置和文本区域后,将该图像样本作为待训练图像样本。在其他实施例中,也可以通过其他方式获得待训练图像样本,只要待训练图像样本有明确对应的目标文本区域以及目标指尖位置即可。The image sample to be trained may be an image containing the fingertip and the fingertip pointing to the text content, and the image whose fingertip position and text area have been identified may be used as the image sample to be trained, or after the image sample is obtained, After manually identifying the position of the fingertip and the text area, the image sample is used as the image sample to be trained. In other embodiments, the to-be-trained image samples may also be obtained in other ways, as long as the to-be-trained image samples have clearly corresponding target text regions and target fingertip positions.
步骤S2042:通过待训练指尖回归与文本检测综合模型对所述待训练图像样 本进行指尖定位和文本检测,获得检测到的样本文本区域以及样本指尖位置。Step S2042: Perform fingertip positioning and text detection on the to-be-trained image sample by using the to-be-trained fingertip regression and text detection integrated model to obtain the detected sample text area and the sample fingertip position.
将待训练图像样本输入到待训练指尖回归与文本检测综合模型,通过待训练指尖回归与文本检测综合模型对待训练图像样本进行处理,获得对其进行检测获得的该待训练样本图像的样本指尖位置,以及与该样本指尖位置对应的样本文本区域。Input the to-be-trained image sample into the to-be-trained fingertip regression and text detection integrated model, process the to-be-trained image sample through the to-be-trained fingertip regression and text detection integrated model, and obtain a sample of the to-be-trained sample image obtained by detecting it The fingertip position, and the sample text area corresponding to the sample fingertip position.
步骤S2043:基于所述待训练图像样本对应的样本文本区域以及目标文本区域,计算确定文本区域损失。Step S2043: Calculate and determine the loss of the text region based on the sample text region corresponding to the image sample to be trained and the target text region.
文本区域损失,是用以表征待训练图像样本的样本文本区域与目标文本区域之间的差异性的参数,用以体现在检测过程中待训练图像样本的文本区域的损失量。可以通过各种可能的方式计算确定文本区域损失,例如在一些实施例中,可以是将目标文本区域与样本文本区域的差值作为文本区域损失,在其他实施例中,基于目标文本区域和样本文本区域,也可以采用其他的方式计算出文本区域损失,例如均方误差等等。The text area loss is a parameter used to characterize the difference between the sample text area of the image sample to be trained and the target text area, and is used to reflect the loss of the text area of the image sample to be trained during the detection process. The text area loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target text area and the sample text area can be used as the text area loss. In other embodiments, based on the target text area and the sample text area. For the text area, other methods can also be used to calculate the text area loss, such as mean square error and so on.
步骤S2044:基于所述待训练图像样本对应的样本指尖位置以及目标指尖位置,计算确定指尖定位损失。Step S2044: Calculate and determine the fingertip positioning loss based on the sample fingertip positions corresponding to the image samples to be trained and the target fingertip positions.
指尖定位损失,是用以表征待训练图像样本的样本指尖位置与目标指尖位置的差异性的参数,用以体现在检测过程中待训练图像样本的指尖定位的损失量。可以通过各种可能的方式计算确定指尖定位损失,例如在一些实施例中,可以是将目标指尖位置与样本指尖位置的差值作为指尖定位损失,在其他实施例中,基于目标指尖位置和样本指尖位置,也可以采用其他的方式计算出指尖定位损失,例如均方误差等等。Fingertip positioning loss is a parameter used to characterize the difference between the sample fingertip position of the image sample to be trained and the target fingertip position, and is used to reflect the loss of fingertip positioning of the image sample to be trained during the detection process. The fingertip localization loss can be calculated and determined in various possible ways. For example, in some embodiments, the difference between the target fingertip position and the sample fingertip position can be used as the fingertip localization loss. Fingertip position and sample fingertip position, fingertip positioning loss can also be calculated in other ways, such as mean square error and so on.
步骤S2045:结合所述文本区域损失和指尖定位损失,确定模型损失。Step S2045: Determine the model loss by combining the text area loss and the fingertip localization loss.
模型损失可以结合文本区域损失和指尖定位损失综合确定。在一些实施例中,可以将各待训练图像样本的文本区域损失和指尖定位损失的和值,作为模型损失。在其他实施例中,可以将待训练图像样本的文本区域损失和指尖定位损失加权求和后,对各待训练图像样本的加权求和后的损失进行求和,并将求 和得到的和值作为模型损失。在其他实施例中,也可以采用其他的方式结合各待训练图像样本的文本区域损失和指尖定位损失确定出模型损失。The model loss can be comprehensively determined by combining the text area loss and the fingertip localization loss. In some embodiments, the sum of the text area loss and the fingertip localization loss of each image sample to be trained may be used as the model loss. In other embodiments, after the weighted summation of the text area loss and the fingertip positioning loss of the image samples to be trained, the weighted summation losses of the image samples to be trained can be summed, and the sum obtained by the summation value as the model loss. In other embodiments, the model loss may also be determined by combining the text area loss and fingertip location loss of each image sample to be trained in other ways.
若基于上述计算得到的模型损失确定满足模型训练结束条件,则将最后一次训练的待训练指尖回归与文本检测综合模型,作为训练得到的指尖回归与文本检测综合模型。一些实施例中,在确定满足模型训练结束条件时,也可以是在最后一次训练的待训练指尖回归与文本检测综合模型的基础上,通过去掉指尖定位的输出部分,以得到最终的训练得到的指尖回归与文本检测综合模型,使得在最终进行使用时,可以只需要输出文本区域即可。If it is determined that the model training end condition is satisfied based on the model loss obtained by the above calculation, the last training to be trained fingertip regression and text detection comprehensive model is used as the fingertip regression and text detection comprehensive model obtained by training. In some embodiments, when it is determined that the end condition of model training is met, the final training can be obtained by removing the output part of fingertip positioning on the basis of the comprehensive model of fingertip regression and text detection to be trained in the last training. The obtained comprehensive model of fingertip regression and text detection makes it possible to output only the text area in the final use.
模型训练结束条件可以结合实际技术需要进行设定,在一些实施例中,可以是在所述模型损失小于或者等于预设损失量且达到预设模型迭代次数时,确定达到模型训练结束条件。在其他实施例中,也可以采用其他的方式来确定模型训练结束条件。The model training end condition may be set according to actual technical requirements. In some embodiments, it may be determined that the model training end condition is reached when the model loss is less than or equal to a preset loss amount and reaches a preset number of model iterations. In other embodiments, other manners may also be used to determine the model training end condition.
若基于上述计算得到的模型损失确定不满足模型训练结束条件,则进入下述步骤S2046。If it is determined based on the model loss obtained by the above calculation that the model training end condition is not met, the following step S2046 is entered.
步骤S2046:基于所述模型损失调整所述待训练指尖回归与文本检测综合模型,返回通过待训练指尖回归与文本检测综合模型对所述待训练图像样本进行指尖定位和文本检测的步骤,直至达到模型训练结束条件。Step S2046: Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained , until the model training end condition is reached.
在对待训练指尖回归与文本检测综合模型进行调整时,可以调整与指尖定位相关的模型参数,也可以是调整与文本区域检测相关的模型参数。一些实施例中,可以结合文本区域损失和指尖定位损失来确定如何对模型参数进行调整。在其他实施例中,也可以采用其他方式来确定对模型参数进行调整的调整策略。When adjusting the comprehensive model of fingertip regression and text detection to be trained, model parameters related to fingertip positioning can be adjusted, and model parameters related to text area detection can also be adjusted. In some embodiments, the text area loss and the fingertip localization loss may be combined to determine how to adjust the model parameters. In other embodiments, other methods may also be used to determine the adjustment strategy for adjusting the model parameters.
步骤S205:识别所述文本区域中的文本内容。Step S205: Identify the text content in the text area.
在获得文本区域后,则可以针对文本区域进行文本内容识别,获得文本区域中的文本内容。可以采用各种可能的方式识别出文本区域中的文本内容,例如采用OCR(Optical Character Recognition,光学字符识别)识别出文本内容。在其他实施例中,也可以采用其他的方式识别出文本区域中的文本内容,本申 请实施例不做具体限定。After the text area is obtained, the text content recognition may be performed on the text area to obtain the text content in the text area. Various possible ways can be used to recognize the text content in the text area, for example, OCR (Optical Character Recognition, optical character recognition) is used to recognize the text content. In other embodiments, other methods may also be used to identify the text content in the text area, which is not specifically limited in this embodiment of the present application.
基于如上所述的本申请实施例中的方法,其通过检测采集的视频帧图像中的手指指尖获得指尖位置后,以检测的指尖位置为基准截取候选区域后,针对该候选区域进行进一步地指尖回归定位,以此为基础对候选区域进行文本检测,获得文本区域后,识别文本区域中的文本内容,以此实现指尖定位的文本内容识别过程,在这个过程中,结合两次指尖定位来最终实现文本区域的检测,初次通过高速指尖定位的方式来截取候选区域,再通过进一步的指尖回归定位来检测出候选区域中的文本区域,在提高文本内容检测的效率的基础上,提高了检测的文本内容的准确度。Based on the method in the embodiment of the present application as described above, after the fingertip position is obtained by detecting the fingertip in the collected video frame image, the candidate area is intercepted based on the detected fingertip position, and the Further fingertip regression positioning, based on this, text detection is performed on the candidate area. After the text area is obtained, the text content in the text area is identified, so as to realize the text content recognition process of fingertip positioning. In this process, combining two The secondary fingertip positioning is used to finally realize the detection of the text area. For the first time, the candidate area is intercepted by high-speed fingertip positioning, and then the text area in the candidate area is detected by further fingertip regression positioning, which improves the efficiency of text content detection. On the basis of , the accuracy of the detected text content is improved.
基于如上所述的实施例中的方法,以下结合其中一个应用示例进行详细举例说明。Based on the method in the above-mentioned embodiment, a detailed description is given below in conjunction with one of the application examples.
参考图4所示,在一个具体示例中,本申请实施例在进行文本内容识别时,首先通过摄像装置进行拍摄得到桌面视频流,该桌面视频流会包含多帧的视频帧图像。Referring to FIG. 4 , in a specific example, when text content recognition is performed in the embodiment of the present application, a desktop video stream is first obtained by shooting with a camera, and the desktop video stream may include multiple frames of video frame images.
然后,对桌面视频流进行高速指尖定位,即快速、轻量地对视频帧图像中的指尖位置进行识别,从而通过高速指尖检测定位可以直接帮助后面的稳定性判断以及候选区域提取。Then, high-speed fingertip positioning is performed on the desktop video stream, that is, the position of the fingertip in the video frame image is quickly and lightly identified, so that the high-speed fingertip detection and positioning can directly help the subsequent stability judgment and candidate region extraction.
其中,用以进行高速指尖定位的指尖识别模型,一些实施例中可以采用CNN模型,其特征提取backbone可以根据业务场景的特点进行选择(如ResNet、MobileNet……),后面回归模块使用dense层,损失loss使用均方误差(Mean Square Error,MSE)等。Among them, for the fingertip recognition model used for high-speed fingertip positioning, a CNN model can be used in some embodiments, and the feature extraction backbone can be selected according to the characteristics of the business scenario (such as ResNet, MobileNet...), and the latter regression module uses dense Layer, loss loss using mean square error (Mean Square Error, MSE) and so on.
本申请实施例方案中,通过指尖识别模型对桌面视频流进行高速指尖定位和检测,指尖识别模型通过对指尖位置进行识别,并在桌面视频流中未识别到手指指尖时,则指尖识别模型输出的回归目标为预设回归位置坐标,该预设回归位置坐标可以设置为未在摄像装置的拍摄范围的坐标,例如(-1,-1)。如果在桌面视频流中识别到手指指尖,则指尖识别模型输出的回归目标为其真实坐标, 即该指尖识别模型识别的指尖位置坐标(例如(123,56)),即获得手指指尖的指尖位置。In the solution of the embodiment of the present application, high-speed fingertip positioning and detection is performed on the desktop video stream through the fingertip recognition model. Then, the regression target output by the fingertip recognition model is a preset regression position coordinate, and the preset regression position coordinate can be set as a coordinate that is not in the shooting range of the camera device, such as (-1, -1). If the fingertip is recognized in the desktop video stream, the regression target output by the fingertip recognition model is its real coordinates, that is, the position coordinates of the fingertip recognized by the fingertip recognition model (for example (123, 56)), that is, the finger is obtained. The fingertip position of the fingertip.
在获得上述指尖位置之后,因为用户的指尖可能还没定下来,因此通过进一步判断指尖的稳定性,并在指尖稳定的情况下,才会进行下一步的处理过程。在一些实施例中,可以通过计算视频相邻帧的指尖位置的坐标变化来判断,例如如果某一帧的指尖位置相比于前预设数目的相邻视频帧(例如前3帧以上的相邻视频帧)的坐标偏移量小于预设偏移量,比如10个像素值(应当理解,这个值可以结合摄像装置的分辨率调整或者其他因素的考虑进行设定),则认为指尖稳定。一些具体示例中,预设偏移量的设定,可以结合对稳定的精度要求进行设定。若对稳定性要求相对较低,则预设偏移量可以设置的相对较大,例如15个像素值,即手指的指尖位置在前预设数目的相邻视频帧的坐标偏移量都在15个像素值的范围内,则认为指尖稳定,这种方式下,对于用户来说,用户可以在相对较大的范围内活动,有助于提高用户体验。若对稳定性要求相对较高,则预设偏移量可以设置的相对较小,例如5个像素值,即手指的指尖位置在前预设数目的相邻视频帧的坐标偏移量都在5个像素值的范围内,则认为指尖稳定,这种方式下,对于用户来说,需要用户只能在相对较小的范围内活动。在实际技术场景中,可以结合摄像装置的分辨率调整、用户体验、稳定性精度等综合考虑。After the above fingertip position is obtained, because the user's fingertip may not yet be determined, the next step will be performed only by further judging the stability of the fingertip, and only when the fingertip is stable. In some embodiments, it can be determined by calculating the coordinate change of the position of the fingertip of the adjacent frames of the video, for example, if the position of the fingertip of a certain frame is compared with the previous preset number of adjacent video frames (for example, the previous 3 frames or more) The coordinate offset of adjacent video frames) is less than the preset offset, such as 10 pixel values (it should be understood that this value can be set in combination with the resolution adjustment of the camera device or other factors), it is considered that the index refers to The tip is stable. In some specific examples, the setting of the preset offset may be set in combination with the requirement for stable accuracy. If the stability requirements are relatively low, the preset offset can be set relatively large, such as 15 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 15 pixel values, the fingertip is considered to be stable. In this way, for the user, the user can move within a relatively large range, which helps to improve the user experience. If the stability requirements are relatively high, the preset offset can be set to a relatively small value, such as 5 pixel values, that is, the coordinate offsets of the adjacent video frames whose fingertip position is in the previous preset number are all Within the range of 5 pixel values, the fingertip is considered to be stable. In this way, for the user, it is required that the user can only move within a relatively small range. In the actual technical scenario, the resolution adjustment, user experience, stability accuracy, etc. of the camera device can be comprehensively considered.
在指尖稳定之后,围绕上述高速指尖定位确定的指尖位置,截取一定的区域作为候选区域,例如以指尖位置为中心,向第一坐标轴和第二坐标轴的两侧外分别扩展第一数目像素和第二数目像素(例如100像素),具体可以根据设备拍摄的像素进行确定),一个实施例中,如图4所示,基于指尖位置51确定截取区域52后,指尖位置51位于截取区域52的中心。当扩展后的位置位于视频帧图像之外时,则直接基于边界来确定截取区域。一个实施例中,如图5所示,基于指尖位置51往外扩展的区域如图5中的虚线框所示,此时,则基于向下扩展后的扩展边界61、向左扩展后的边界62、向上扩展后的边界63以及视频帧 图像的边界60共同确定出截取的候选区域。在其他实施例中,也可以采用其他的方式确定出截取区域。通过即取候选区域可以有效提高二次指尖定位精度,同时可以使文本检测的准确度有所提升。After the fingertip is stabilized, around the fingertip position determined by the high-speed fingertip positioning, a certain area is intercepted as a candidate area. The first number of pixels and the second number of pixels (for example, 100 pixels, which can be determined according to the pixels captured by the device), in one embodiment, as shown in FIG. The position 51 is located in the center of the interception area 52 . When the expanded position is outside the video frame image, the clipping area is determined directly based on the boundary. In one embodiment, as shown in FIG. 5 , the area expanded outward based on the fingertip position 51 is shown as the dotted line box in FIG. 5 . At this time, based on the expanded boundary 61 expanded downward and the boundary expanded to the left 62. The up-expanded boundary 63 and the boundary 60 of the video frame image jointly determine a candidate region for interception. In other embodiments, the interception area may also be determined in other manners. By taking the candidate area immediately, the positioning accuracy of the secondary fingertip can be effectively improved, and the accuracy of text detection can be improved at the same time.
在获得上一步提取的候选区域后,进一步采用指尖回归与文本检测综合模型确定出文本区域。其中,文本内容的提取与指尖定位和文本检测都有关,而指尖的坐标和文本内容的坐标是高度相关的,因此如图6所示,本实施例方案中使用多重损失的方案来训练得到指尖回归与文本检测综合模型,以此进行指尖定位和文本检测,并输出识别出的指尖位置对应的文本区域。After obtaining the candidate regions extracted in the previous step, the integrated model of fingertip regression and text detection is further used to determine the text regions. Among them, the extraction of text content is related to both fingertip positioning and text detection, and the coordinates of the fingertip and the coordinates of the text content are highly correlated. Therefore, as shown in Figure 6, the scheme of this embodiment uses a multiple loss scheme to train A comprehensive model of fingertip regression and text detection is obtained to perform fingertip positioning and text detection, and output the text area corresponding to the identified fingertip position.
然后,将识别出的文本区域输入到文本识别模型中进行识别,从而识别出具体的文本内容。Then, the recognized text area is input into the text recognition model for recognition, so as to recognize the specific text content.
在一些实施例中,以查词引用为例,在识别出文本内容后,可以进一步查询词典获得识别出的文本内容的释义,并将获得的释义在计算机设备上的显示屏上显示,如图1所示。In some embodiments, taking word lookup as an example, after identifying the text content, you can further query the dictionary to obtain the definition of the identified text content, and display the obtained definition on the display screen of the computer device, as shown in the figure 1 shown.
如上所述的本实施例的方案,充分考虑桌面学习设备的应用场景,先进行快速的指尖定位,据此确定截取区域,再针对截取区域进行更精细化的指尖回归定位和文本区域的检测,然后再进行文本内容的识别,其由粗到细的方式既能降低整体计算量,也能较好地提升回归坐标的精度;另外,将指尖定位和文本检测合并到一个模型里面,可以进一步提升定位精度。The solution of this embodiment as described above fully considers the application scenario of the desktop learning device, firstly performs fast fingertip positioning, determines the interception area accordingly, and then performs more refined fingertip regression positioning and text area for the interception area. Detection, and then the recognition of the text content, the method from coarse to fine can not only reduce the overall calculation, but also better improve the accuracy of the regression coordinates; in addition, the fingertip positioning and text detection are combined into one model. The positioning accuracy can be further improved.
应该理解的是,虽然如上所述的各实施例涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,这些流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in these flowcharts may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution sequence of these steps or stages It is also not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of a step or phase within the other steps.
在一个实施例中,如图7所示,提供了一种文本内容识别装置,所述装置包括:In one embodiment, as shown in FIG. 7, a text content recognition apparatus is provided, and the apparatus includes:
图像采集模块701,设置为获取当前采集的视频帧图像;The image acquisition module 701 is configured to acquire the currently acquired video frame image;
指尖位置检测模块702,设置为检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置;A fingertip position detection module 702, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;
候选区域截取模块703,设置为以所述指尖位置为基准,截取所述视频帧图像中的候选区域;The candidate area interception module 703 is set to intercept the candidate area in the video frame image based on the position of the fingertip;
指尖回归和文本区域检测模块704,设置为通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域;The fingertip regression and text area detection module 704 is configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain the text area in the candidate area;
内容识别模块705,设置为识别所述文本区域中的文本内容。The content recognition module 705 is configured to recognize the text content in the text area.
一些实施例中,所述装置还包括:指尖稳定态确定模块,设置为基于所述指尖位置,确定所述手指指尖是否处于指尖稳定状态;In some embodiments, the device further includes: a fingertip stability determination module configured to determine whether the fingertip is in a fingertip stable state based on the fingertip position;
所述候选区域截取模块703,设置为在指尖稳定态确定模块的输出结果为手指指尖处于指尖稳定状态时,以所述指尖位置为基准,截取所述视频帧图像中的候选区域。The candidate area interception module 703 is configured to intercept the candidate area in the video frame image based on the position of the fingertip when the output result of the fingertip stable state determination module is that the fingertip is in the fingertip stable state. .
一些实施例中,指尖稳定态确定模块,在所述指尖位置与前预设数目的相邻视频帧图像中的手指指尖位置的偏移量小于预设偏移量时,确定所述手指指尖处于指尖稳定状态。In some embodiments, the fingertip stability determination module, when the offset between the fingertip position and the fingertip position in the previous preset number of adjacent video frame images is less than a preset offset, determine the fingertip position. Fingertips are in fingertip stability.
一些实施例中,所述指尖位置检测模块702,通过训练获得的指尖识别模型检测所述视频帧图像中的手指指尖;在所述指尖识别模型检测到手指指尖且检测到的手指指尖时,输出检测到的手指指尖的位置的坐标,获得所述手指指尖的指尖位置;在所述指尖识别模型未检测到手指指尖时,输出预设回归位置坐标,所述预设回归位置坐标为不属于所述视频帧图像的坐标范围的坐标。In some embodiments, the fingertip position detection module 702 detects the fingertip in the video frame image through the fingertip recognition model obtained by training; the fingertip is detected by the fingertip recognition model and the detected When the fingertip is on the fingertip, output the coordinates of the detected position of the fingertip to obtain the fingertip position of the fingertip; when the fingertip recognition model does not detect the fingertip, output the preset return position coordinates, The preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.
一些实施例中,所述候选区域截取模块703,以所述指尖位置为中心点,向第一坐标轴的两侧分别扩展第一数目像素,向第二坐标轴的两侧分别扩展第二 数目像素;根据扩展后的各指尖位置形成的区域确定所述视频帧图像中的候选区域。In some embodiments, the candidate region intercepting module 703, taking the fingertip position as the center point, expands the first number of pixels to the two sides of the first coordinate axis respectively, and expands the second number of pixels to the two sides of the second coordinate axis respectively. The number of pixels; the candidate region in the video frame image is determined according to the regions formed by the expanded positions of the fingertips.
一些实施例中,所述候选区域截取模块703,设置为当扩展后的指尖位置位于所述视频帧图像的边界外时,将扩展后位于所述视频帧图像的边界外的扩展后的指尖位置对应的边界、与其他扩展后的指尖位置形成的区域确定为所述视频帧图像中的候选区域。In some embodiments, the candidate region intercepting module 703 is configured to, when the position of the extended fingertip is located outside the boundary of the video frame image, select the extended fingertip located outside the boundary of the video frame image after the expansion. The boundary corresponding to the tip position and the region formed by other extended fingertip positions are determined as candidate regions in the video frame image.
一些实施例中,指尖回归和文本区域检测模块704,设置为通过训练获得的指尖回归与文本检测综合模型对所述候选区域进行指尖定位和文本检测,获得所述候选区域中的文本区域。In some embodiments, the fingertip regression and text area detection module 704 is configured to perform fingertip positioning and text detection on the candidate area through the comprehensive model of fingertip regression and text detection obtained by training, and obtain the text in the candidate area. area.
一些实施例中,还包括:指尖回归与文本检测综合模型训练模块,设置为获取待训练图像样本,以及与所述待训练图像样本对应的目标文本区域以及目标指尖位置;通过待训练指尖回归与文本检测综合模型对所述待训练图像样本进行指尖定位和文本检测,获得检测到的样本文本区域以及样本指尖位置;基于所述待训练图像样本对应的样本文本区域以及目标文本区域,计算确定文本区域损失;基于所述待训练图像样本对应的样本指尖位置以及目标指尖位置,计算确定指尖定位损失;结合所述文本区域损失和指尖定位损失,确定模型损失;基于所述模型损失调整所述待训练指尖回归与文本检测综合模型,返回通过待训练指尖回归与文本检测综合模型对所述待训练图像样本进行指尖定位和文本检测的步骤,直至达到模型训练结束条件。In some embodiments, it also includes: a comprehensive model training module for fingertip regression and text detection, configured to obtain image samples to be trained, as well as target text regions and target fingertip positions corresponding to the image samples to be trained; The integrated model of tip regression and text detection performs fingertip positioning and text detection on the image samples to be trained, and obtains the detected sample text area and the sample fingertip position; based on the sample text area and target text corresponding to the image samples to be trained area, calculate and determine the text area loss; calculate and determine the fingertip location loss based on the sample fingertip position corresponding to the image sample to be trained and the target fingertip location; combine the text area loss and the fingertip location loss to determine the model loss; Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained, until reaching Model training end condition.
关于文本内容识别装置的具体限定可以参见上文中对于文本内容识别方法的限定,在此不再赘述。上述文本内容识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the text content recognition device, reference may be made to the above limitation on the text content recognition method, which will not be repeated here. Each module in the above-mentioned text content recognition apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、 存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器设置为提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口设置为与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、近场通信(Near Field Communication,NFC)或其他技术实现。该计算机程序被处理器执行时以实现一种文本内容识别方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 8 . The computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein, the processor of the computing device is arranged to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is set to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be implemented through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The computer program, when executed by the processor, implements a text content recognition method. The display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现如上所述的任一实施例中的方法。In one embodiment, there is provided a computer device comprising a memory and a processor, a computer program is stored in the memory, and the processor implements the method in any of the above-described embodiments when the processor executes the computer program.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现如上所述的任一实施例中的方法。In one embodiment, there is provided a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the method in any of the above-described embodiments.
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。In one embodiment, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述 各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the various embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

Claims (11)

  1. 一种文本内容识别方法,包括:A text content recognition method, comprising:
    获取当前采集的视频帧图像;Get the currently captured video frame image;
    检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置;Detecting the fingertip in the video frame image, and obtaining the fingertip position of the fingertip;
    以所述指尖位置为基准,截取所述视频帧图像中的候选区域;Taking the position of the fingertip as a benchmark, intercepting the candidate area in the video frame image;
    通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域;By performing fingertip regression positioning on the candidate region, and performing text detection on the candidate region, the text region in the candidate region is obtained;
    识别所述文本区域中的文本内容。Text content in the text area is identified.
  2. 根据权利要求1所述的方法,在获得所述手指指尖的指尖位置之后,以所述指尖位置为基准,截取所述视频帧图像中的候选区域之前,还包括:The method according to claim 1, after obtaining the fingertip position of the fingertip, using the fingertip position as a benchmark, before cutting out the candidate region in the video frame image, further comprising:
    基于所述指尖位置,确定所述手指指尖是否处于指尖稳定状态;determining, based on the fingertip position, whether the fingertip is in a fingertip stable state;
    在所述手指指尖处于指尖稳定状态时,进入以所述指尖位置为基准,截取所述视频帧图像中的候选区域的步骤;When the fingertip is in a stable state of the fingertip, enter the step of intercepting the candidate region in the video frame image based on the fingertip position;
    在所述手指指尖不是处于指尖稳定状态时,返回检测所述视频帧图像中的手指指尖的步骤。When the fingertip is not in the steady state of the fingertip, return to the step of detecting the fingertip in the video frame image.
  3. 根据权利要求2所述的方法,其中,The method of claim 2, wherein,
    在所述指尖位置与前预设数目的相邻视频帧图像中的手指指尖位置的偏移量小于预设偏移量时,确定所述手指指尖处于指尖稳定状态。When the offset between the position of the fingertip and the position of the fingertip in the previous preset number of adjacent video frame images is less than a preset offset, it is determined that the fingertip is in a stable fingertip state.
  4. 根据权利要求1所述的方法,其中,检测所述视频帧图像中的手指指尖,获得所述手指指尖的指尖位置,包括:The method according to claim 1, wherein detecting the fingertip in the video frame image to obtain the fingertip position of the finger, comprising:
    通过训练获得的指尖识别模型检测所述视频帧图像中的手指指尖;The fingertip recognition model obtained by training detects the fingertips in the video frame images;
    在所述指尖识别模型检测到手指指尖时,输出检测到的手指指尖的位置的坐标,获得所述手指指尖的指尖位置;When the fingertip recognition model detects the fingertip, output the coordinates of the position of the detected fingertip to obtain the fingertip position of the fingertip;
    在所述指尖识别模型未检测到手指指尖时,输出预设回归位置坐标,所述预设回归位置坐标为不属于所述视频帧图像的坐标范围的坐标。When the fingertip recognition model does not detect the fingertip, output preset return position coordinates, where the preset return position coordinates are coordinates that do not belong to the coordinate range of the video frame image.
  5. 根据权利要求1所述的方法,其中,以所述指尖位置为基准,截取所述视频帧图像中的候选区域,包括:The method according to claim 1, wherein, taking the fingertip position as a reference, intercepting a candidate region in the video frame image, comprising:
    以所述指尖位置为中心点,向第一坐标轴的两侧分别扩展第一数目像素,向第二坐标轴的两侧分别扩展第二数目像素;Taking the position of the fingertip as the center point, the first number of pixels are respectively extended to both sides of the first coordinate axis, and the second number of pixels are respectively extended to both sides of the second coordinate axis;
    根据扩展后的各指尖位置形成的区域确定所述视频帧图像中的候选区域。The candidate regions in the video frame image are determined according to the regions formed by the expanded positions of the fingertips.
  6. 根据权利要求5所述的方法,其中,根据扩展后的各指尖位置形成的区域确定所述视频帧图像中的候选区域,包括:The method according to claim 5, wherein determining the candidate region in the video frame image according to the region formed by the expanded positions of the fingertips comprises:
    当扩展后的指尖位置位于所述视频帧图像的边界外时,将扩展后位于所述视频帧图像的边界外的扩展后的指尖位置对应的边界、与其他扩展后的指尖位置形成的区域确定为所述视频帧图像中的候选区域。When the expanded fingertip position is outside the border of the video frame image, the border corresponding to the expanded fingertip position outside the border of the video frame image is formed with other expanded fingertip positions. The area of is determined as a candidate area in the video frame image.
  7. 根据权利要求1所述的方法,其中,通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域,包括:The method according to claim 1, wherein, performing text detection on the candidate region by performing fingertip regression positioning on the candidate region to obtain the text region in the candidate region, comprising:
    通过训练获得的指尖回归与文本检测综合模型对所述候选区域进行指尖定位和文本检测,获得所述候选区域中的文本区域。Fingertip localization and text detection are performed on the candidate region by the comprehensive model of fingertip regression and text detection obtained by training, and the text region in the candidate region is obtained.
  8. 根据权利要求7所述的方法,其中,训练获得所述指尖回归与文本检测综合模型的方式包括:The method according to claim 7, wherein the manner of obtaining the comprehensive model of fingertip regression and text detection by training comprises:
    获取待训练图像样本,以及与所述待训练图像样本对应的目标文本区域以及目标指尖位置;Obtaining image samples to be trained, as well as the target text area and target fingertip positions corresponding to the image samples to be trained;
    通过待训练指尖回归与文本检测综合模型对所述待训练图像样本进行指尖定位和文本检测,获得检测到的样本文本区域以及样本指尖位置;Perform fingertip positioning and text detection on the to-be-trained image sample by using the to-be-trained fingertip regression and text detection integrated model to obtain the detected sample text area and the sample fingertip position;
    基于所述待训练图像样本对应的样本文本区域以及目标文本区域,计算确定文本区域损失;Calculate and determine the loss of the text region based on the sample text region and the target text region corresponding to the image sample to be trained;
    基于所述待训练图像样本对应的样本指尖位置以及目标指尖位置,计算确定指尖定位损失;Calculate and determine the fingertip positioning loss based on the sample fingertip position corresponding to the image sample to be trained and the target fingertip position;
    结合所述文本区域损失和指尖定位损失,确定模型损失;Combine the text region loss and fingertip localization loss to determine the model loss;
    基于所述模型损失调整所述待训练指尖回归与文本检测综合模型,返回通过待训练指尖回归与文本检测综合模型对所述待训练图像样本进行指尖定位和文本检测的步骤,直至达到模型训练结束条件。Adjust the comprehensive model of fingertip regression and text detection to be trained based on the model loss, and return to the step of performing fingertip positioning and text detection on the image sample to be trained by using the comprehensive model of fingertip regression and text detection to be trained, until reaching Model training end condition.
  9. 一种文本内容识别装置,包括:A text content recognition device, comprising:
    图像采集模块,设置为获取当前采集的视频帧图像;an image acquisition module, set to acquire the currently acquired video frame image;
    指尖位置检测模块,设置为检测所述视频帧图像中的手指指尖,并获得所述手指指尖的指尖位置;A fingertip position detection module, configured to detect the fingertip in the video frame image, and obtain the fingertip position of the fingertip;
    候选区域截取模块,设置为以所述指尖位置为基准,截取所述视频帧图像中的候选区域;A candidate region interception module, configured to intercept the candidate region in the video frame image based on the fingertip position;
    指尖回归和文本区域检测模块,设置为通过对所述候选区域进行指尖回归定位,对所述候选区域进行文本检测,获得所述候选区域中的文本区域;A fingertip regression and text area detection module, configured to perform text detection on the candidate area by performing fingertip regression positioning on the candidate area to obtain a text area in the candidate area;
    内容识别模块,设置为识别所述文本区域中的文本内容。A content recognition module, configured to recognize the text content in the text area.
  10. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现权利要求1至8中任一项所述的方法。A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the method of any one of claims 1 to 8 when the processor executes the computer program.
  11. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的方法。A computer-readable storage medium storing a computer program on the computer-readable storage medium, when the computer program is executed by a processor, implements the method of any one of claims 1 to 8.
PCT/CN2022/082690 2021-03-29 2022-03-24 Method and apparatus for text content recognition, computer device, and storage medium WO2022206534A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110336251.7A CN115131693A (en) 2021-03-29 2021-03-29 Text content identification method and device, computer equipment and storage medium
CN202110336251.7 2021-03-29

Publications (1)

Publication Number Publication Date
WO2022206534A1 true WO2022206534A1 (en) 2022-10-06

Family

ID=83375572

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/082690 WO2022206534A1 (en) 2021-03-29 2022-03-24 Method and apparatus for text content recognition, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN115131693A (en)
WO (1) WO2022206534A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909342A (en) * 2023-01-03 2023-04-04 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact point motion track
CN116939292A (en) * 2023-09-15 2023-10-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070211071A1 (en) * 2005-12-20 2007-09-13 Benjamin Slotznick Method and apparatus for interacting with a visually displayed document on a screen reader
CN109325464A (en) * 2018-10-16 2019-02-12 上海翎腾智能科技有限公司 A kind of finger point reading character recognition method and interpretation method based on artificial intelligence
CN110443231A (en) * 2019-09-05 2019-11-12 湖南神通智能股份有限公司 A kind of fingers of single hand point reading character recognition method and system based on artificial intelligence
CN111078083A (en) * 2019-06-09 2020-04-28 广东小天才科技有限公司 Method for determining click-to-read content and electronic equipment
CN111242109A (en) * 2020-04-26 2020-06-05 北京金山数字娱乐科技有限公司 Method and device for manually fetching words

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070211071A1 (en) * 2005-12-20 2007-09-13 Benjamin Slotznick Method and apparatus for interacting with a visually displayed document on a screen reader
CN109325464A (en) * 2018-10-16 2019-02-12 上海翎腾智能科技有限公司 A kind of finger point reading character recognition method and interpretation method based on artificial intelligence
CN111078083A (en) * 2019-06-09 2020-04-28 广东小天才科技有限公司 Method for determining click-to-read content and electronic equipment
CN110443231A (en) * 2019-09-05 2019-11-12 湖南神通智能股份有限公司 A kind of fingers of single hand point reading character recognition method and system based on artificial intelligence
CN111242109A (en) * 2020-04-26 2020-06-05 北京金山数字娱乐科技有限公司 Method and device for manually fetching words

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909342A (en) * 2023-01-03 2023-04-04 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact point motion track
CN116939292A (en) * 2023-09-15 2023-10-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment
CN116939292B (en) * 2023-09-15 2023-11-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment

Also Published As

Publication number Publication date
CN115131693A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN110232311B (en) Method and device for segmenting hand image and computer equipment
US9697416B2 (en) Object detection using cascaded convolutional neural networks
US10943106B2 (en) Recognizing text in image data
WO2022206534A1 (en) Method and apparatus for text content recognition, computer device, and storage medium
CN109685055B (en) Method and device for detecting text area in image
US9052755B2 (en) Overlapped handwriting input method
CN110136198B (en) Image processing method, apparatus, device and storage medium thereof
WO2019041519A1 (en) Target tracking device and method, and computer-readable storage medium
US10847073B2 (en) Image display optimization method and apparatus
JP2021524951A (en) Methods, devices, devices and computer readable storage media for identifying aerial handwriting
US8917957B2 (en) Apparatus for adding data to editing target data and displaying data
CN110598559B (en) Method and device for detecting motion direction, computer equipment and storage medium
CN109033935B (en) Head-up line detection method and device
CN111160288A (en) Gesture key point detection method and device, computer equipment and storage medium
WO2022105569A1 (en) Page direction recognition method and apparatus, and device and computer-readable storage medium
CN111309618A (en) Page element positioning method, page testing method and related device
JP7429307B2 (en) Character string recognition method, device, equipment and medium based on computer vision
EP4030749A1 (en) Image photographing method and apparatus
US10067926B2 (en) Image processing system and methods for identifying table captions for an electronic fillable form
KR102440198B1 (en) VIDEO SEARCH METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM
CN110431563B (en) Method and device for correcting image
CN110245570B (en) Scanned text segmentation method and device, computer equipment and storage medium
CN109190615A (en) Nearly word form identification decision method, apparatus, computer equipment and storage medium
CN111160265B (en) File conversion method and device, storage medium and electronic equipment
US11024305B2 (en) Systems and methods for using image searching with voice recognition commands

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778717

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22778717

Country of ref document: EP

Kind code of ref document: A1