CN113762045A

CN113762045A - Click-to-read position identification method and device, click-to-read equipment and storage medium

Info

Publication number: CN113762045A
Application number: CN202110488678.9A
Authority: CN
Inventors: 项小明; 王禹; 刘睿哲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-12-07

Abstract

The application relates to a reading position identification method, a reading position identification device, reading equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be identified in a point reading area; carrying out target detection on the image to be recognized to obtain a target detection result; if the image to be recognized is determined to contain the first preset target or the second preset target according to the target detection result, specific position information in the first preset target or the second preset target is output, and the specific position represents a specified target position in the preset targets; and if the image to be recognized contains the first preset target and the second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the priority of the preset first preset target and the second preset target. The method determines the position pointed by the user by performing target detection on the image, can realize point reading without a matched point reading pen, supports the point reading position pointed by two targets, and enriches supportable point reading modes.

Description

Click-to-read position identification method and device, click-to-read equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying a click-to-read position, a click-to-read device, and a storage medium.

Background

In the study of teenagers, reading books is one of the important ways, and more parents can choose to make teenagers read paper books in order to better protect the eyes of teenagers. When a teenager encounters an unknown word or an unintelligible content or the like while reading a paper book, the teenager needs to answer the book. With the development of the technology, products such as a point reading machine appear, the point reading machine can support teenagers to identify the book content with doubt by identifying the indicated position, identify the corresponding book content and give corresponding response; for example, teenagers indicate unknown words, the click-to-read machine provides pronunciation, explanation and the like, and for example, if teenagers indicate an inequality, the click-to-read machine provides an answer method and the like.

In the related art, some point reading machines must use matched hardware equipment to perform point reading, for example, a magnetic device is arranged at a customized book and a pre-embedded point reading position, and a matched point reading pen is used for point reading a corresponding position to trigger corresponding content, so that the mode has large limitation.

Disclosure of Invention

In view of the above, it is necessary to provide a reading position identification method, a reading position identification apparatus, a reading device, and a storage medium capable of supporting rich reading modes in order to solve the above technical problems.

A click-to-read position identification method, the method comprising:

acquiring an image to be identified in a point reading area;

carrying out target detection on the image to be identified to obtain a target detection result;

if the image to be recognized is determined to contain a first preset target or a second preset target according to the target detection result, outputting specific position information in the first preset target or the second preset target, wherein the specific position represents a specified target position in the preset targets;

and if the image to be recognized contains a first preset target and a second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the preset priority of the first preset target and the second preset target.

A click-to-read position identifying apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be identified in the point reading area;

the target detection module is used for carrying out target detection on the image to be identified to obtain a target detection result;

the position information output module is used for outputting specific position information in the first preset target or the second preset target if the image to be recognized is determined to contain the first preset target or the second preset target according to the target detection result, wherein the specific position represents a specified target position in the preset targets; and if the image to be recognized contains a first preset target and a second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the priority of the preset first preset target and the second preset target.

A point-reading device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an image to be identified in a point reading area;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an image to be identified in a point reading area;

and if the image to be recognized contains a first preset target and a second preset target according to the target detection result, reading the priority of the first preset target and the priority of the second preset target and outputting specific position information in the preset target with higher priority.

The method and the device for recognizing the click-to-read position, the click-to-read device and the storage medium acquire an image to be recognized in a click-to-read area, and perform target detection on the image to be recognized, wherein the target detection comprises detection of a first preset target and a second preset target, if only one preset target is included in a target detection result, a position of a specific position in the preset target is output, and if two preset targets are included, specific position information in the preset target with higher priority is output and determined as the click-to-read position. The method determines the pointed position of the user by performing target detection on the image, can realize point reading without a matched point reading pen, and supports the point reading position pointed by two targets, such as a mode of point reading by fingers or a mode of point reading by a pen point, thereby enriching supportable point reading modes.

Drawings

FIG. 1 is a flow chart illustrating a method for identifying a click-to-read position in one embodiment;

FIG. 2 is a flow chart illustrating a method for identifying a click-to-read position in one embodiment;

FIG. 3 is a schematic flow chart illustrating an embodiment of inputting image features into two or more attribute prediction branches, respectively, and obtaining an attribute prediction result output by each attribute prediction branch;

FIG. 4 is a diagram illustrating a network architecture of a click-to-read location identification model in an exemplary embodiment;

FIG. 5(1) is a schematic diagram of the center position of the preset target (hand) in one embodiment;

FIG. 5(2) is a thermodynamic diagram corresponding to a predetermined target center position in an exemplary embodiment;

FIG. 6(1) is a diagram illustrating a specific location (fingertip) of a predetermined target in one embodiment;

FIG. 6(2) is a thermodynamic diagram corresponding to the position of a fingertip or tip in one embodiment;

fig. 7(1) is a schematic diagram of corresponding text contents that are identified and output by the reading device according to the coordinate position of the reading target in an embodiment;

fig. 7(2) is a schematic diagram of corresponding text contents recognized and output by the reading device according to the coordinate position of the reading target in another embodiment;

FIG. 8 is a block diagram showing the structure of a click-to-read position recognition apparatus according to an embodiment;

fig. 9 is an internal structural diagram of the point reading device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In some embodiments, the click-to-read position identification method provided by the application can be applied to a click-to-read device, the click-to-read device acquires an image to be identified in a click-to-read region, and performs target detection on the image to be identified, where the target detection includes detection of a first preset target and a second preset target, and on a target detection result, if only one preset target is included, a position of a specific position in the preset target is output, and if two preset targets are included, specific position information in the preset target with a higher priority among the preset targets is output, and the specific position information is determined as a click-to-read position, and then text content is identified and output according to the position of the specific position, so that the click-to-read purpose is achieved.

In other embodiments, the reading position identification method provided by the present application may be applied to a system including a reading device and a server. The point-reading device is communicated with the server through a network. The server acquires an image to be recognized in a point reading area from the point reading equipment, and performs target detection on the image to be recognized, wherein the target detection comprises detection of a first preset target and a second preset target, if only one preset target is included in a target detection result, a position of a specific position in the preset target is output, if two preset targets are included, specific position information in the preset target with higher priority is output, the specific position information is determined as a point reading position, and finally the determined point reading position is fed back to the point reading equipment, so that the point reading equipment can recognize text content according to the position of the specific position to output, and the point reading purpose is achieved. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform. The point-reading device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, which have an image capturing function. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Wherein, cloud computing (cloud computing) refers to a delivery and use mode of an IT infrastructure, and refers to acquiring required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

In some embodiments of the present application, the use of computer vision to identify the presence of a particular target in a captured image is contemplated. Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

In some embodiments of the present application, it is also related to image feature extraction, attribute prediction, etc. by using a neural network, which belongs to machine learning. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In one embodiment, as shown in fig. 1, a click-to-read position identification method is provided, which includes steps S110 to S140.

Step S110, acquiring the image to be identified in the point reading area.

The reading area represents an image acquisition area of the reading device, and is generally a range covered by an image acquisition module of the reading device. For example, in one embodiment, the movable reading device is placed on a desktop, and the coverage area aligned by the image acquisition module of the reading device is the reading area. In this embodiment, the image collected in the point reading area is recorded as the image to be recognized.

In one embodiment, the method is applied to a system of a server and a point-reading device, and the image to be identified in the point-reading area is obtained as an image acquired by an image acquisition module of the point-reading device from the point-reading device by the server; when the method is applied to the point reading equipment, the image to be identified in the point reading area is acquired by the point reading equipment, and the image acquired by the image acquisition module is acquired by the point reading equipment.

Further, in an embodiment, during the operation of the click-to-read device, an image in the click-to-read region is acquired at preset time intervals, wherein the preset time period may be set to any time length according to actual situations.

And step S120, carrying out target detection on the image to be recognized to obtain a target detection result.

Target detection, also called target extraction, is an image segmentation based on target geometry and statistical features. In the embodiment, after the image to be recognized is acquired, target detection is performed in the image to be recognized; further, the target detection in this embodiment includes detecting whether the image to be recognized includes a first preset target and/or a second preset target.

In one embodiment, the target detection of the first preset target and the second preset target on the image to be recognized can be realized in any one mode. For example, in one embodiment, two target detection models are trained, wherein one target detection model is used for detecting a first preset target, and the other target detection model is used for detecting a second preset target. In another embodiment, a model including a feature extraction part and an attribute prediction part is trained, wherein a branch of the attribute prediction part for detecting whether a preset target is included includes two channels for detecting a first preset target and a second preset target, respectively. In other embodiments, this may be accomplished in other ways as well.

Step S130, if it is determined that the image to be recognized includes the first preset target or the second preset target according to the target detection result, outputting specific position information in the first preset target or the second preset target, where the specific position represents a target position specified in the preset targets.

Step S140, if it is determined that the image to be recognized includes the first preset target and the second preset target according to the target detection result, outputting specific position information in the preset target with a higher priority according to the priorities of the preset first preset target and the second preset target.

Determining whether the image to be recognized contains a first preset target or a second preset target according to a target detection result, and if only one preset target is detected, outputting specific position information in the preset target; and if the image to be recognized only contains the first preset target, outputting specific position information in the first preset target, and if the image to be recognized only contains the second preset target, outputting specific position information in the second preset target.

And if the image to be recognized simultaneously contains the first preset target and the second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the priority of the preset first preset target and the second preset target. The priority of the first preset target and the priority of the second preset target can be preset according to actual conditions.

Wherein the specific position in the preset target represents a position set in the preset target in advance. In practical applications, when a user faces text content which cannot be understood, a fingertip, a pen point and the like may be pointed to a position which needs to be read, that is, a position in contact with a text to be recognized, and when a reading device and the like recognizes the reading position, the positions of the fingertip, the pen point and the like need to be recognized. In one embodiment, the specific position of the first preset target and the specific position of the second preset target may be set according to actual situations, for example, the first preset target is a palm, the specific position of the first preset target is a finger, the second preset target is a pen, the specific position of the second preset target is a pen tip, and so on.

In another embodiment, if it is determined that the first preset target or the second preset target is not detected in the image to be recognized according to the target detection result, it may be that the user does not perform a point reading operation currently, and no position information is output at this time.

In one embodiment, the first preset target represents a hand (such as an arm or a palm portion) of a user, the second preset target represents a pen (such as a sign pen, a capacitance pen or any other pen), and if the priority of the second preset target is higher than that of the first preset target, specific position information in the hand is output if only the hand of the user is detected in the image to be recognized, and if only the pen is detected in the image to be recognized, position information of the pen is output, and if the hand and the pen are detected simultaneously, specific position information in the pen of the second preset target is output according to the priority. In other embodiments, the preset target may be other targets. In this embodiment, the first preset target is used as a hand, the second preset target is used as a pen as an example, a user may click a text content to be known through a fingertip or a pen point in a scene of writing with the pen, the method supports two modes of the hand and the pen to click and read, in such a scene, the user does not need to deliberately select a fixed mode to click and read, but can randomly switch between the fingertip and the pen point to click and read, so that the tedious operation of clicking and reading in the fixed mode is reduced, the operation difficulty of the user is reduced, and the user experience is improved.

Further, in one embodiment, when the position information of the preset target is output, the position information of the contact position of the preset target and the text content to be recognized is output. For example, in a specific embodiment, the user points to the text content to be recognized through a finger tip, that is, the position of the finger tip is the position corresponding to the text content to be recognized that needs to be output; and outputting the position information of the finger tip, or outputting the position information of the pen tip when the user points to the text content to be recognized through the pen tip.

The click-to-read position identification method comprises the steps of obtaining an image to be identified in a click-to-read area, carrying out target detection on the image to be identified, wherein the target detection comprises the detection of a first preset target and a second preset target, outputting the position of a specific position in the preset targets if only one preset target is included in the target detection result, and outputting the information of the specific position in the preset targets with higher priority levels if two preset targets are included, so as to determine the position as the click-to-read position. The method determines the pointed position of the user by performing target detection on the image, can realize point reading without a matched point reading pen, and supports the point reading position pointed by two targets, such as a mode of point reading by fingers or a mode of point reading by a pen point, thereby enriching supportable point reading modes.

In one embodiment, the target detection is performed on the image to be recognized, and obtaining the target detection result includes: extracting image characteristics of an image to be identified; respectively inputting image characteristics into more than two attribute prediction branches, and acquiring an attribute prediction result output by each attribute prediction branch; wherein the attribute prediction branch comprises: the central position of the preset target predicts the branch and the specific position in the preset target predicts the branch.

The features of an image can be divided into two levels, including low-level visual features, and high-level semantic features. The low-layer visual characteristics comprise texture, color and shape. Semantic features are things-to-things relationships. The texture feature extraction algorithm comprises the following steps: the gray level co-occurrence matrix method and the Fourier power spectrum method color feature extraction algorithm comprise the following steps: histogram method, cumulative histogram method, color clustering method, etc. The shape feature extraction algorithm comprises the following steps: spatial moment features and the like: semantic network, mathematical logic, framework, etc. In other embodiments, extracting features of the image may also be accomplished via a neural network.

In one embodiment, extracting image features of an image to be recognized comprises: extracting image features of an image to be identified through a feature extraction network; the feature extraction network comprises a continuous down-sampling layer and a continuous up-sampling layer; the number of the down-sampling layers is a first preset number, the number of the up-sampling layers is a second preset number, and the second preset number is smaller than or equal to the first preset number.

In machine learning, pattern recognition and image processing, feature extraction starts with an initial set of measurement data and establishes derivative values (features) that are intended to provide information and non-redundancy, facilitating subsequent learning and generalization steps, and in some cases leading to better interpretability. In this embodiment, extracting the image features of the image to be recognized may be implemented in any manner.

Downsampled (subsampled), also called downsampled (downsampled), has two main purposes: 1. fitting the image to the size of the display area; 2. a thumbnail of the corresponding image is generated. The down-sampling principle: for an image I with size M × N, s-fold down sampling is performed to obtain a resolution image with size (M/s) × (N/s), of course, s should be a common divisor of M and N, if an image in matrix form is considered, the image in the original image s × s window is changed into a pixel, and the value of the pixel is the average value of all pixels in the window. In a particular embodiment, the image to be identified is downsampled using successive downsampling layers, wherein the downsampling layers use a downsampling factor of 2.

Up-sampling, also called image interpolation (interpolating), is mainly aimed at enlarging the original image so that it can be displayed on a higher resolution display device. The up-sampling principle is as follows: the image amplification almost adopts an interpolation method, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of the original image pixels. Common methods for interpolating an image include a conventional image interpolation method, an edge-based image interpolation algorithm, a region-based image interpolation algorithm, and the like. In one embodiment, the minimum size downsampled image feature is upsampled using successive upsampling layers, where the upsampling layer has a sampling multiple of 2, i.e., the image feature output by the upsampling layer is 2 times the size of the input image feature.

Further, in one embodiment, extracting image features of the image to be recognized through a feature extraction network includes: carrying out down-sampling on an image to be identified layer by layer to obtain down-sampled image characteristics with different scales; wherein, the number of the down-sampling layers is a first preset number; performing up-sampling on each down-sampled image characteristic layer by layer through a characteristic pyramid network to obtain the image characteristic of the image to be identified; the number of the upsampling layers is a second preset number, and the second preset number is smaller than or equal to the first preset number. In one embodiment, the down-sampling layer comprises 5 layers, and the up-sampling layer comprises 3 layers, namely, the image to be recognized is input into the feature extraction network, and the size of the output image feature is 1/4 of the image to be recognized.

In one embodiment, before the image to be recognized is input into the feature extraction network, the method further comprises the step of adjusting the size of the image to be recognized to be a preset size. The preset size may be set to any value according to actual conditions, for example, the preset size is set to 320 × 320. Wherein the resizing of the image may be achieved according to any one of several ways.

Further, in one embodiment, the up-sampling each down-sampled image feature by a feature pyramid network layer by layer comprises: taking the minimum-size downsampling image feature as an initial upsampling image feature, splicing the minimum-size downsampling image feature and the initial upsampling image feature, and inputting the spliced minimum-size downsampling image feature and the initial upsampling image feature into a first upsampling layer to obtain a first upsampling image feature; splicing the first up-sampling image features and down-sampling image features with the same size, and inputting the spliced first up-sampling image features and the down-sampling image features into a second up-sampling layer to obtain second up-sampling image features; and splicing the second up-sampling image features and the down-sampling image features with the same size, and inputting the spliced second up-sampling image features and the down-sampling image features into a third up-sampling layer to obtain third up-sampling image features, namely the image features of the image to be identified.

In another embodiment, the feature channel of the feature extraction network is N, and the balance between the accuracy and the operation speed can be adjusted according to the requirement of the device on the operation performance of the model.

According to the embodiment, the image features are extracted through the pre-trained neural network, and the more accurate image features can be extracted for target detection through training of the neural network.

In one embodiment, after the image features are extracted, detecting whether a preset target exists by using the image features; in this embodiment, the attribute prediction is performed on the extracted image features through attribute prediction branches, where the attribute prediction branches include at least a central position prediction branch of the preset target and a specific position prediction branch in the preset target. The output result of the central position prediction branch of the preset target is the position information of the central position of the preset target, and the output result of the specific position prediction branch in the preset target is the specific position information in the preset target; it is to be understood that, in this embodiment, the target detection result includes position information of a center position of the preset target, and specific position information of the preset target.

In one embodiment, the location-specific predicted branch includes a location-specific predicted sub-branch and a location-specific drift predicted sub-branch; as shown in fig. 2, the method includes steps S210 to S230 of inputting image features into two or more attribute prediction branches, and obtaining an attribute prediction result output by each attribute prediction branch.

Step S210, inputting the image features into the central position prediction branch of the preset target, and obtaining a thermodynamic diagram corresponding to the central position of the first preset target and a thermodynamic diagram corresponding to the central position of the second preset target.

Thermodynamic diagrams (heatmaps) can reflect data information in a two-dimensional matrix or table in terms of color changes, which can visually represent the size of data values in a defined shade of color. In the present embodiment, after the center position of the preset target is obtained, it is converted into a form of thermodynamic diagram to be represented. In one embodiment, the thermodynamic diagram is specifically a gaussian thermodynamic diagram, and the two-dimensional coordinates are subjected to gaussian transformation to obtain a corresponding gaussian thermodynamic diagram. The conversion of the coordinates into a gaussian thermodynamic diagram can be achieved in any manner.

The central position of the preset target represents the central position of the generated preset target boundary frame when the preset target is detected. The preset target bounding box is a box, usually a rectangular box, which contains the preset target; if an arm is detected, the arm includes a bounding box, and if a pen is detected, the arm includes a bounding box.

In one embodiment, the branch for predicting the center position of the preset target includes two channels, wherein one channel is used for outputting the center position thermodynamic diagram of the first preset target, and the other channel is used for outputting the center position thermodynamic diagram of the second preset target. In one embodiment, the central position prediction branch of the preset target comprises 3 × 3 convolutional layers and 1 × 1 convolutional layers, wherein the 3 × 3 convolutional layers are used for attribute learning, and the dimension of the channel number to the target attribute is adjusted through 1 × 1 convolution.

Step S220, inputting the image features into the specific position predictor branch to obtain a thermodynamic diagram corresponding to the specific position in the first preset target and a thermodynamic diagram corresponding to the specific position in the second preset target.

Similar to the central position prediction branch, in this step, for a specific position in the preset target, the position is also converted into a thermodynamic diagram for representation.

In one embodiment, the specific location predictor branch comprises two channels, wherein one channel is used for outputting the specific location thermodynamic diagram of the first preset target, and the other channel is used for outputting the specific location thermodynamic diagram of the second preset target. In one embodiment, the location-specific predictor branch includes 3 × 3 convolutional layers and 1 × 1 convolutional layers, wherein attribute learning is performed by the 3 × 3 convolutional layers and the channel number is adjusted to the dimension of the target attribute by 1 × 1 convolution.

Step S230, inputting the image feature into the specific position drift predictor branch, and obtaining the specific position drift when the image to be identified is converted into the thermodynamic diagram.

Since only integer coordinates of pixels can be represented in the thermodynamic diagram, a conversion error may occur when converting a coordinate position in an original image into a representation in the thermodynamic diagram, and therefore, in the present embodiment, a conversion error of a coordinate position from float to int when converting an image to be recognized into the thermodynamic diagram is predicted by a specific position drift predictor branch. In a specific embodiment, the image feature size in the input specific position drift predictor branch is 1/4 of the image size to be recognized, and assuming that the pixel coordinate in the image to be recognized is (120,121), the coordinate of the image feature is (30,30.25), which is expressed as a positive integer coordinate (30,30) in the thermodynamic diagram, and the position drift is (0,0.25) in the embodiment.

In one embodiment, the specific location drift predictor branch comprises two channels; if it is detected that the image to be recognized includes the first preset target and the second preset target at the same time, the abscissa and ordinate values of the specific positions of the two preset targets need to be output, so that the four channels in this embodiment are respectively used for outputting the position drift of the abscissa and ordinate values of the specific positions of the two preset targets. In one embodiment, the specific location drift predictor branch comprises 3 × 3 convolutional layers and 1 × 1 convolutional layers, wherein attribute learning is performed through the 3 × 3 convolutional layers, and the channel number is adjusted to the dimension of the target attribute through 1 × 1 convolution.

In this embodiment, the central positions of the first preset target and the second preset target, the specific positions of the first preset target and the second preset target, and the position drift of the specific positions of the first preset target and the second preset target are respectively predicted through three attribute prediction branches; the method comprises the steps of predicting a branch according to a center position of a preset target, determining whether a first preset target and/or a second preset target is detected or not according to an output result of the branch, determining the preset target to be output if the first preset target and/or the second preset target is detected, outputting the second preset target only if the first preset target is detected, outputting the first preset target only if the second preset target is detected, and outputting the preset target with higher priority if the first preset target and the second preset target are detected, and determining a specific position in the preset target to be output according to output results of a specific position prediction sub-branch and a specific position drift prediction sub-branch.

In one embodiment, as shown in fig. 3, after the target detection is performed on the image to be recognized, and a target detection result is obtained, steps S310 to S330 are further included.

Step S310, traversing thermodynamic diagrams corresponding to the central positions output by the central position prediction branches of the preset target, and determining a predicted position confidence of the central position of the first preset target and a predicted position confidence of the central position of the second preset target.

In one embodiment, the thermodynamic diagrams output for the two preset target center positions can be acquired by traversing each channel of the preset target center position prediction branch. In an embodiment, the predicted position output by any one channel of the predicted branch of the center position of the preset target may include a plurality of predicted positions, and therefore, the confidence degrees corresponding to the predicted positions are respectively read, that is, in this embodiment, the confidence degree of the predicted position of the center position of the first preset target and the confidence degree of the predicted position of the center position of the second preset target. The confidence level is also called reliability, or confidence level, confidence coefficient, and the estimated value and the overall parameter are within a certain allowable error range, and the corresponding probability is called confidence level, and the higher the confidence level is, the more likely the target value is to be close to the correct value.

In step S320, if the maximum value of the confidence of the predicted position of the center position of the first preset target is greater than or equal to the preset threshold, it is determined that the image to be recognized includes the first preset target.

In step S330, if the maximum value of the confidence of the predicted position of the center position of the second preset target is greater than or equal to the preset threshold, it is determined that the image to be recognized includes the second preset target.

In the embodiment, a preset threshold is preset, and the preset threshold is compared with the confidence coefficient; further, in this embodiment, if the confidence of the center position is greater than a preset threshold, it indicates that a corresponding preset target is detected in the image to be recognized. In another embodiment, if the maximum value of the confidence of the predicted position of the center position of the first preset target is smaller than a preset threshold, it is determined that the image to be recognized does not include the first preset target, and if the maximum value of the confidence of the predicted position of the center position of the second preset target is smaller than the preset threshold, it is determined that the image to be recognized does not include the second preset target. The preset threshold may be set according to actual conditions, for example, the preset threshold is set to 80%, 90%, and the like.

In this embodiment, the confidence in the output result of the predicted branch of the preset target center position is compared with the preset threshold to determine whether the image to be recognized includes the first preset target and/or the second preset target, so that the accuracy of target detection can be improved.

In one embodiment, the method further comprises: and if the image to be recognized contains the first preset target and/or the second preset target according to the target detection result, reading the specific position according to the thermodynamic diagram corresponding to the specific position in the preset target, and restoring the specific position in the preset target to the original coordinate same as the image to be recognized according to the specific position drift to obtain the specific position information corresponding to the preset target.

In this embodiment, after determining that the image to be recognized includes the first preset target and/or the second preset target, it is necessary to output coordinate information of a specific position corresponding to the preset target, predict, according to the specific position, a coordinate of the specific position in the thermodynamic diagram, predict, according to the specific position drift, a position drift of the specific position when the specific position is converted into the thermodynamic diagram, and further restore the coordinate of the specific position in the thermodynamic diagram to coordinate information having the same size as that of the image to be recognized, thereby determining the position of the specific position in the image to be recognized, and subsequently recognizing text content at the specific position, thereby achieving the purpose of reading.

In one embodiment, the attribute prediction branch further comprises: presetting a size prediction branch of a target and a distance prediction branch between the central position of the target and a specific position; and when the feature extraction network and each attribute prediction branch are trained, adjusting the parameters of the feature extraction network and each attribute prediction branch based on the sample prediction result output by each attribute prediction branch.

In one embodiment, the size prediction branch of the preset target is used for outputting the size of a bounding box of the preset target; further, the preset target size prediction branch is used to output the width and height of the bounding box of the preset target. The distance prediction branch of the preset target center position and the specific position is used for outputting: presetting the distance between the center position of the target and a specific position; for example, the distance between the finger position and the center position of the arm boundary box, or the distance between the pen point position and the center position of the pen boundary box.

Wherein training the neural network may be accomplished in any of a variety of ways. When the feature extraction network and each attribute prediction branch are trained, training is carried out based on sample data carrying annotation data, the sample data is input into a preset neural network framework (comprising the feature extraction network and each attribute prediction branch), and sample prediction results output by 5 attribute prediction branches are output, wherein the sample prediction results comprise thermodynamic diagrams corresponding to the center position of a preset target of a sample image, thermodynamic diagrams corresponding to specific positions in the preset target of the sample image, specific position drift in the preset target of the sample image, the size of a boundary box of the preset target of the sample image, and the distance between the specific positions and the center position in the preset target of the sample image. And then adjusting parameters of the feature extraction network and each attribute prediction branch based on a sample prediction result output by each attribute prediction branch, and stopping training when a termination condition is reached to obtain a neural network determined by training, wherein the neural network comprises the feature extraction network and each attribute prediction branch.

In one embodiment, the sample data for training the neural network comprises at least: the method comprises the steps of only containing sample data of a first preset target, only containing sample images of a second preset target and simultaneously containing sample images of the first preset target and the second preset target.

In one embodiment, the size prediction branch of the preset target and the distance prediction branch of the preset target center position and the specific position are similar in structure and respectively comprise 3 × 3 convolution layers and 1 × 1 convolution layers, wherein attribute learning is performed through the 3 × 3 convolution layers, and the dimension from the channel number to the target attribute is adjusted through 1 × 1 convolution.

In this embodiment, when the network is trained, the network is trained by combining the output results of the size prediction branch of the preset target and the distance prediction branch between the center position of the preset target and the specific position, and these are effective supervision information in the training stage, which is beneficial to the training of the whole task.

In one embodiment, feature extraction is carried out on the image to be recognized based on a feature extraction network determined in advance through training, and attribute prediction is carried out on the image features based on an attribute prediction branch network determined in advance through training. In one embodiment, the feature extraction network selects MobileNetV2, and the attribute prediction branch uses a convolutional network for different attribute predictions. In other embodiments, other networks may be employed for the feature extraction network and the attribute prediction branch network.

The application also provides an application scene, and the application scene applies the click-to-read position identification method.

Specifically, the application of the click-to-read position identification method in the application scene is as follows:

in the present embodiment, the feature extraction network and each output attribute prediction branch are collectively referred to as a click-to-read position recognition model. Fig. 4 is a schematic diagram of a network structure of the click-to-read location identification model in an embodiment.

1. And acquiring an image to be identified in the point reading area, and adjusting the size of the image to be identified to 320 × 320 as the input of the model.

2. The click-to-read position identification model is composed of two parts, namely a) a Backbone network Feature extraction part < Backbone Feature >, and b) an Attribute prediction part < Attribute Predict >.

a) The Backbone network Feature extraction part adopts MobileNet V2 as a Backbone, then the last three layers are up-sampled in a FPN (Feature pyramid network) mode, Feature fusion is carried out to obtain image features, the size of the image features is 1/4 (80 x 80) of the input image to be recognized, the Feature channel is N, and the balance of precision and running speed can be realized by adjusting according to the requirement of equipment on model running performance.

b) The attribute predicting part predicts different attributes by adopting a convolution network, takes two types of preset targets of a hand and a pen as main bodies, and simultaneously regresses the coordinates of the fingertip and the pen point. And (b) from the feature diagram extracted from the a to each attribute branch, performing attribute learning through 3 × 3 convolution, and adjusting the channel number to the dimension of the target attribute through 1 × 1 convolution, wherein the attribute comprises the following contents:

i.e. center (N ═ 2, the channel is 2), a thermodynamic diagram (Heatmap) of the position of the preset target center point is predicted, for the bounding box (bounding box) of the hand and the pen with data labeled, the coordinates of the center point are calculated, and the coordinates of the center point are used as the center to convert the coordinates into a gaussian thermodynamic diagram form, as shown in fig. 5(1), a schematic diagram of the center position of the preset target (hand) in a specific embodiment is shown, as shown in fig. 5(2), a thermodynamic diagram corresponding to the center position of the preset target in a specific embodiment is shown.

w/H (N ═ 2, channel 2), height and width of the BoundingBox of the hand or pen, width and height of the preset target if present for each point.

Relationship (N ═ 4, channel 4), coordinate difference between the coordinate of the fingertip or pen tip and the coordinate of the center point, i.e. coordinate value of the center of the opposite finger tip or pen, is used to control the relative relationship between the target point and the center point

Keypoints (N ═ 2, channel 2), coordinate thermodynamic diagram of the fingertip or pen tip (same as the Center operation), as shown in fig. 6(1), a schematic diagram of the specific position (finger tip) of the preset target in one embodiment, and as shown in fig. 6(2), a thermodynamic diagram corresponding to the position of the fingertip or pen tip in one embodiment.

Offset (N is 4, channel is 4), the coordinate point of the fingertip or the pen point drifts, Offset is a floating point number obtained by 1/4 down-sampling from the original map coordinate, but only integer coordinates can be represented in the thermodynamic diagram, and the conversion error from float to int is learned by Offset.

When the model is recognized by the training point-reading position, two types of data are adopted for model training data acquisition, namely a) independent fingertip gesture data and data which do not contain a pen, and b) pen holding data which contain the pen, the pen holding and a non-pen holding hand. When data are organized in the training stage, the target detection simultaneously uses data in ab, and the data in the training target thermodynamic diagram format are supplemented with an empty thermodynamic diagram, so that the semantic difference of three different gestures, namely a fingertip gesture, a common gesture and a pen holding gesture, is strengthened.

3. After the model is trained, when the model is integrated in an SDK (software development kit), the following point-reading coordinate solving mode is adopted, only three branches of Center, Keypoints and Offset are used in the model running stage, and the steps are as follows:

a) traversing two channels of the Center, wherein the channel 0 represents a target of a hand, the channel 1 represents a target of a pen, finding out the position with the maximum value of each channel, if the value is greater than the set threshold A, indicating that the position has a valid hand or pen, otherwise, no target is preset in the current frame.

b) Traversing two channels of Keypoints, wherein a channel 0 represents the position of a fingertip, a channel 1 represents the position of a pen point, finding out the position with the maximum value of each channel, if the value is greater than a set threshold value B, indicating that the position has an effective fingertip or the pen point, otherwise, not presetting a specific position of a target in the current frame.

c) And b, taking the Offset value of the coordinate position corresponding to the Offset in the coordinate position found in the b, and restoring the Offset value to the original coordinate.

d) Returning the position coordinates of the pen tip if only the pen tip exists according to the results of a) and b); if only the fingertip exists, returning the position coordinates of the fingertip; if the fingertip and the pen tip exist at the same time, returning the position coordinates of the pen tip with higher priority according to the priorities of the fingertip and the pen tip.

Further, after the coordinate position of the reading target output in the above embodiment is output, the corresponding text content is identified according to the coordinate position of the reading target, so as to achieve the purpose of reading; as shown in fig. 7(1) and fig. 7(2), the corresponding text content schematic diagrams are identified and output by the reading device according to the coordinate position of the reading target.

According to the click-to-read position identification method, the target detection is taken as a main body to simultaneously detect the hand and the pen through a multi-task learning mode of a deep learning model, and the coordinates of the fingertip and the pen point are simultaneously regressed, so that the fusion identification of the fingertip and the pen point is realized in a single model, and the semantic confusion problem of common gestures and click-to-read gestures can be better solved; and simultaneously supporting fingertip identification and pen point identification is realized by setting the priority of the pen point, and when the fingertip and the pen point exist at the same time, the pen point is preferentially returned by setting the mode that the pen point has high priority. By the method, a seamless interactive process can be realized in a click-to-read scene, no interruption operation is performed between the writing and the click-to-read of the user by using a pen, and the learning efficiency and the immersive experience are greatly improved. On the premise of not changing the point reading equipment based on the camera, a point reading interaction mode with the integration of the pen point and the fingertip is provided, the product functions of both the fingertip point reading and the pen point reading are realized, and the use experience of a user is well improved.

In one embodiment, the present application further provides a point reading method, including the steps of: acquiring an image to be identified in the point reading area, identifying corresponding text content according to specific position information of a preset target in the image to be identified, and displaying the text content in a display screen. The method comprises the following steps of determining the text content corresponding to the specific position information identification of the preset target in the image to be identified, wherein the step comprises the following steps: carrying out target detection on the image to be identified to obtain a target detection result; if the image to be recognized is determined to contain a first preset target or a second preset target according to the target detection result, outputting specific position information in the first preset target or the second preset target, wherein the specific position represents a specified target position in the preset targets; and if the image to be recognized contains a first preset target and a second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the preset priority of the first preset target and the second preset target.

For a specific embodiment of the above touch reading method, reference may be made to the above embodiment of the touch reading position identification method, which is not described herein again.

It should be understood that, although the steps in the flowcharts involved in the above embodiments are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart involved in the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 8, there is provided a reading position recognition apparatus, which may adopt a software module or a hardware module, or a combination of the two modules, as a part of a reading device, and the apparatus specifically includes: an image acquisition module 810, a target detection module 820, and a location information output module 830, wherein:

the image acquisition module 810 is configured to acquire an image to be identified in the click-to-read region;

the target detection module 820 is used for performing target detection on the image to be identified to obtain a target detection result;

the position information output module 830 is configured to output specific position information in the first preset target or the second preset target if it is determined that the image to be recognized includes the first preset target or the second preset target according to the target detection result, where the specific position indicates a specified target position in the preset targets; and if the image to be recognized contains the first preset target and the second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the priority of the preset first preset target and the second preset target.

The click-to-read position recognition device acquires an image to be recognized in a click-to-read area, and performs target detection on the image to be recognized, wherein the target detection comprises detection of a first preset target and a second preset target, and if only one preset target is included in a target detection result, a position of a specific position in the preset target is output, and if two preset targets are included, specific position information in the preset target with a higher priority is output, and the specific position information is determined as a click-to-read position. The device determines the pointed position of the user by carrying out target detection on the image, can realize point reading without a matched point reading pen, and supports the point reading position pointed by two targets, such as a mode of point reading by fingers or a mode of point reading by a pen point, thereby enriching supportable point reading modes.

In one embodiment, the object detection module 820 of the above apparatus comprises: the characteristic extraction unit is used for extracting the image characteristics of the image to be identified; the attribute prediction unit is used for respectively inputting image characteristics into more than two attribute prediction branches and acquiring an attribute prediction result output by each attribute prediction branch; wherein the attribute prediction branch comprises: the central position of the preset target predicts the branch and the specific position in the preset target predicts the branch.

In an embodiment, the feature extraction unit of the apparatus is further configured to: extracting image features of an image to be identified through a feature extraction network; the feature extraction network comprises a continuous down-sampling layer and a continuous up-sampling layer; the number of the down-sampling layers is a first preset number, the number of the up-sampling layers is a second preset number, and the second preset number is smaller than or equal to the first preset number.

In one embodiment, the location-specific predicted branch includes a location-specific predicted sub-branch and a location-specific drift predicted sub-branch; in this embodiment, the attribute prediction unit of the apparatus includes: the central position thermodynamic diagram prediction subunit is used for inputting the image characteristics into a central position prediction branch of a preset target to obtain a thermodynamic diagram corresponding to the central position of the first preset target and a thermodynamic diagram corresponding to the central position of the second preset target; the specific position thermodynamic diagram prediction subunit is used for inputting the image characteristics into the specific position prediction sub-branch to obtain a thermodynamic diagram corresponding to a specific position in the first preset target and a thermodynamic diagram corresponding to a specific position in the second preset target; and the specific position drift prediction subunit is used for inputting the image characteristics into the specific position drift prediction subbranch to obtain the specific position drift when the image to be identified is converted into the thermodynamic diagram.

In one embodiment, the apparatus further includes a confidence reading module, configured to traverse thermodynamic diagrams corresponding to respective center positions output by the center position prediction branches of the preset target, and determine a predicted position confidence of a first preset target center position and a predicted position confidence of a second preset target center position; in this embodiment, the target detection module 820 is further configured to: if the maximum value of the confidence of the predicted position of the central position of the first preset target is greater than or equal to a preset threshold value, determining that the image to be recognized contains the first preset target; and if the maximum value of the confidence of the predicted position of the central position of the second preset target is greater than or equal to the preset threshold, determining that the image to be recognized contains the second preset target.

In one embodiment, the above apparatus further comprises: and the position restoring module is used for reading a specific position according to a thermodynamic diagram corresponding to the specific position in the preset target, restoring the specific position in the preset target to an original coordinate same as the image to be recognized according to the specific position drift, obtaining specific position information corresponding to the preset target, and obtaining the specific position information corresponding to the preset target if the image to be recognized contains the first preset target and/or the second preset target according to the target detection result.

In one embodiment, the attribute prediction branch further comprises: presetting a size prediction branch of a target and a distance prediction branch between the central position of the target and a specific position; the device also comprises a model training module which is used for adjusting the parameters of the feature extraction network and each attribute prediction branch based on the sample prediction result output by each attribute prediction branch when the feature extraction network and each attribute prediction branch are trained.

For a specific embodiment of the reading position identification apparatus, reference may be made to the above embodiment of the reading position identification method, which is not described herein again. The modules in the above-mentioned reading position identification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the point reading device, and can also be stored in a memory in the point reading device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a point-and-read device is provided, the internal structure of which may be as shown in fig. 9. The point reading equipment comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the pointing device is configured to provide computing and control capabilities. The memory of the point reading device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the point-reading device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a click-to-read position identification method. The display screen of the point reading equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the point reading equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad, an image acquisition device (camera) arranged on the shell of the point reading equipment, and an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structure shown in fig. 9 is a block diagram of only a portion of the structure relevant to the present application, and does not constitute a limitation on the point-and-read device to which the present application is applied, and that a particular point-and-read device may include more or less components than those shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, a point-reading device is further provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the reading device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the reading device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A reading position identification method is characterized by comprising the following steps:

acquiring an image to be identified in a point reading area;

2. The point-reading position identification method according to claim 1, wherein the performing target detection on the image to be identified to obtain a target detection result comprises:

extracting image features of the image to be recognized;

respectively inputting the image characteristics into more than two attribute prediction branches, and obtaining attribute prediction results output by each attribute prediction branch; wherein the attribute prediction branch comprises: the central position of the preset target predicts the branch and the specific position in the preset target predicts the branch.

3. The reading position recognition method according to claim 2, wherein the extracting the image feature of the image to be recognized includes:

extracting image features of the image to be identified through a feature extraction network; the feature extraction network comprises a continuous down-sampling layer and a continuous up-sampling layer; the number of the down-sampling layers is a first preset number, the number of the up-sampling layers is a second preset number, and the second preset number is smaller than or equal to the first preset number.

4. The click-to-read position identification method according to claim 2, wherein the specific position prediction branch comprises a specific position prediction subbranch and a specific position drift prediction subbranch;

respectively inputting the image characteristics into more than two attribute prediction branches, and obtaining an attribute prediction result output by each attribute prediction branch, wherein the attribute prediction results comprise:

inputting the image characteristics into a central position prediction branch of the preset target to obtain a thermodynamic diagram corresponding to a first preset target central position and a thermodynamic diagram corresponding to a second preset target central position;

inputting the image characteristics into the specific position predictor branch to obtain a thermodynamic diagram corresponding to a specific position in a first preset target and a thermodynamic diagram corresponding to a specific position in a second preset target;

and inputting the image characteristics into the specific position drift predictor branch to obtain the specific position drift when the image to be identified is converted into thermodynamic diagram.

5. The method for recognizing a click-to-read position according to claim 4, wherein after the target detection is performed on the image to be recognized to obtain a target detection result, the method further comprises:

traversing thermodynamic diagrams corresponding to the central positions output by the central position prediction branches of the preset target, and determining the confidence coefficient of the predicted position of the central position of the first preset target and the confidence coefficient of the predicted position of the central position of the second preset target;

if the maximum value of the confidence of the predicted position of the center position of the first preset target is greater than or equal to a preset threshold value, determining that the image to be recognized contains the first preset target; and

and if the maximum value of the confidence of the predicted position of the central position of the second preset target is greater than or equal to a preset threshold value, determining that the image to be recognized contains the second preset target.

6. The click-to-read position identification method according to claim 5, characterized by further comprising:

if the image to be recognized contains the first preset target and/or the second preset target according to the target detection result, reading a specific position according to a thermodynamic diagram corresponding to the specific position in the preset target, and restoring the specific position in the preset target to the original coordinate same as the image to be recognized according to the specific position drift to obtain specific position information corresponding to the preset target.

7. The method according to any one of claims 3 to 6, wherein the attribute-predicted branch further comprises: presetting a size prediction branch of a target and a distance prediction branch between the central position of the target and a specific position;

and when the feature extraction network and each attribute prediction branch are trained, adjusting the parameters of the feature extraction network and each attribute prediction branch based on the sample prediction result output by each attribute prediction branch.

8. A click-to-read position identifying apparatus, characterized in that the apparatus comprises:

the position information output module is used for outputting specific position information in the first preset target or the second preset target if the image to be recognized is determined to contain the first preset target or the second preset target according to the target detection result, wherein the specific position represents a specified target position in the preset targets; and if the image to be recognized contains a first preset target and a second preset target according to the target detection result, outputting specific position information in the preset target with higher priority according to the preset priority of the first preset target and the second preset target.

9. A point-reading apparatus comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.