CN110287864A

CN110287864A - A kind of intelligent identification of read-write scene read-write element

Info

Publication number: CN110287864A
Application number: CN201910547742.9A
Authority: CN
Inventors: 覃端峰; 冯小娇; 杜亚蒙
Original assignee: Firestone Information Technology Co Ltd
Current assignee: Firestone Information Technology Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-27

Abstract

The present invention provides a kind of intelligent identification of read-write scene read-write element, steps are as follows: S1 obtains the image information of the original image, desktop or books read and write in scene；S2 carries out algorithm detection to the original image of read-write element, judges whether there is human body and whether have face；S3 obtains the 3 d space coordinate of desktop, books and face according to coordinate identification；S4 obtains human eye at a distance from books according to the matching of the three-dimensional coordinate of desktop, books and face；Whether the matching judgment posture of S5, the three-dimensional coordinate according to desktop, books and face meet correct reading and writing standard；Whether the posture and reading distance when this method can effectively judge user's reading are correct, effective pre- myopia prevention.

Description

A kind of intelligent identification of read-write scene read-write element

Technical field

The present invention relates to image identification technical fields, and in particular to a kind of intelligent identification side of read-write scene read-write element Method.

Background technique

China's myopia number has surpassed 600,000,000, almost accounts for the 50% of Chinese total population quantity, myopia morbidity present the age it is early, Be in progress fast, far-gone trend.Report display according to investigations, the bad outstanding problem of student eyesight.Senior class, grade eight student eyesight Bad recall rate is respectively 36.5%, 65.3%, and wherein senior class schoolgirl eyesight moderate is bad and the bad ratio of severe is respectively 18.6%, 10.4%, boy student is respectively 16.4%, 9%；Grade eight schoolgirl's eyesight moderate is bad and the bad ratio of severe is respectively 24.1%, 39.5%, boy student is respectively 22.1%, 31.7%.

In fact, the teen-age overall vision situation in China is not allowed pessimistic, data show, adolescent myopia rate is Height ranks first in the world.Wherein the uninterrupted use of the electronic products such as close eye and mobile phone, computer has great relationship.From close From the point of view of year development trend, teenager is because the lack of standardization and premature contact electronic product of learning posture makes the risk of myopia exist It is gradually increased, has put on glasses from urine.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of sides that element can be read and write with automatic identification current read-write scene Method, to judge whether read-write posture and the reading/writing distance of user comply with standard.

To achieve the above object, the present invention adopts the following technical scheme:

A kind of intelligent identification of read-write scene read-write element, which is characterized in that step includes:

S1 obtains original image, the desktop that element is read and write in read-write scene by the devices such as camera or infrared ray or radar Or the image information of books, described image information include desktop or books frame or two to four vertex；

S2 carries out algorithm detection to the original image of read-write element, judges whether there is human body and whether have face；

S3 obtains the 3 d space coordinate of desktop, books and face according to coordinate identification；

S4 obtains human eye at a distance from books according to the matching of the three-dimensional coordinate of desktop, books and face；

Whether the matching judgment posture of S5, the three-dimensional coordinate according to desktop, books and face meet correct reading and writing Standard.

Further, detecting whether that comprising the concrete steps that for human body requires picture to substitute into instruction according to the rules using algorithm The object training model perfected obtains the bezel locations of human body and the confidence level for human body.

Further, Face datection is in determining picture after someone, and whether detection picture has the face, by picture according to solid Provisioning request substitutes into trained Face datection model, obtain right and left eyes, nose, two corners of the mouths 5 points position and be people The confidence level of face.

Further, before the 3 d space coordinate for determining desktop, books and face, need first to obtain the internal reference of camera Number, the intrinsic parameter of the camera include basic parameter and distortion factor variable, and the basic parameter includes image optical axis principal point, X, Y-direction focal length, the distortion factor vector includes tangential distortion coefficient and coefficient of radial distortion.

Further, the three-dimensional coordinate of object is calculated using pin-hole imaging model, sets a projection centre, principal point For across the image principal point of optical axis, (X, Y, Z) is the object coordinates under space coordinates, (x, y, z) is image pixel coordinates.

Further, using camera intrinsic parameter, Camera extrinsic number and customized 3d space coordinate points (0.0,0.0, 0.0), (1.5,0.0,0.0), (0.0,0.0,1.5), that is, respectively correspond space origins, space X axis, space Y axis, space Z axis, Solve one-to-one image 2D coordinate points；Known spatial is calculated using the projectPoints function that OpenCV is provided Coordinate points in the corresponding image coordinate of axial coordinate point finally connect together correspondence image coordinate points as the space of object seat Mark system.

Further, the Face datection training is to detect human face region by MTCNN and put with face critical point detection Together, based on cascade frame, totally it is divided into PNet, RNet and ONet Three Tiered Network Architecture.

Further, the MTCNN Feature Descriptor mainly includes three parts, face/non-face classifier, bounding box It returns and terrestrial reference positions.

A kind of beneficial effect of the intelligent identification of read-write scene read-write element provided by the invention is: can be effective It obtains posture posture when user reads and distance and is reminded incorrect, pass through reading posture and reading distance Double check judgement has reached more preferably detection effect, and the probability of happening of myopia is reduced with timely correction；Utilize monocular camera Accurate 3 d pose Europe visual angle can be obtained, is then effectively judged by the comparison at Europe visual angle, accuracy rate is high, and only makes At low cost with monocular camera, practicability is wide；The monocular camera of setting has certain angle, has effectively evaded and has slightly bowed and shake Calculating error.

Detailed description of the invention

Fig. 1 is overall flow figure of the present invention；

Fig. 2 is the calculating schematic diagram of pin-hole imaging model.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Whole description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Ability Domain ordinary person every other embodiment obtained without making creative work, belongs to protection of the invention Range.

A kind of embodiment: intelligent identification of read-write scene read-write element.

Monocular camera acquisition picture is simultaneously sent to processor storage analysis；It is flat relative to object to be detected using monocular camera The 3 d pose Europe visual angle in face, i.e. pitch angle, yaw angle, idler wheel angle；

Firstly, obtain the intrinsic parameter of camera by camera calibration, then using known object three-dimensional space coordinate, On image one-to-one image pixel coordinates and camera intrinsic parameter so that solve at this time camera relative to space known substance The outer parameter of body, i.e. rotating vector and translation vector finally carry out Data Analysis Services to rotating vector, solve this phase 3 d pose Europe visual angle of the machine relative to known object space coordinate.

Camera calibration: herein, the imaging model of camera be subject to pin-hole imaging, but due to lens itself with And the problems such as camera manufacturing process, cause imaging model very cannot export image, output figure according to pin-hole imaging model As certainly existing distortion.Therefore, it is necessary to demarcate to camera, the purpose of calibration is to solve for out camera intrinsic parameter, intrinsic parameter packet Include camera basic parameter (image optical axis principal point, X, Y-direction focal length) and distortion factor vector (tangential distortion coefficient, it is radial abnormal Variable coefficient).

The calibration of camera is carried out using chessboard calibration method, the basic thought of chessboard calibration method is by shooting in three-dimensional scenic Same chessboard calibration plate is in different directions, multiple chessboard pictures of different location, because the angle point of each chessboard picture is all At equal intervals, i.e. the 3 d space coordinate of chessboard angle point is that known (three-dimensional coordinate system is relative to each chessboard object For), every checkerboard image is then calculated in the pixel coordinate of the plane of delineation, there is the three dimensional space coordinate of every chessboard figure And the two-dimensional pixel coordinate of corresponding image pixel plane corresponds projection relation, and then finds out the intrinsic parameter of camera.

OpenCV provides calibrateCamera () function and is demarcated, and the intrinsic parameter of camera is obtained using the function, wraps Include camera fundamental matrix:Distortion factor variable: D:(k₁,k₂,p₁,p₂[,k₃[,k₄,k₅,k₆]]), Middle k₁,k₂For radial distortion, p₁,p₂For tangential distortion, to common camera, we generally only need first four coefficient, but for Distort very big camera, and such as fish-eye camera, we should use 5-8 coefficient variation.

Camera Attitude estimation key algorithm is to solve the problems, such as N point perspective projection, also referred to as PNP (Persperctive-N- Point) problem, herein, we are subject to pin-hole imaging model.O is projection centre, principal point (u₀,v₀) for across the figure of optical axis Principal point.(X, Y, Z) is the object coordinates under space coordinates, and reference frame here is the projection centre of camera, (x, y, It z) is image pixel coordinates, the origin of image pixel coordinates is the upper left corner.u₀

According to above-mentioned pin hole projection relation, x=f* (X/Z), we can be readily derived following projection relation, use Matrix form indicates above formula, has:Wherein f_xFor the focal length that horizontal pixel indicates, f_yIt is perpendicular The focal length that straight pixel indicates.

When reference frame is not on the projection centre of camera, it is shown below:

According to formula x=M* [R | t] * X, it is desirable that parameter [R | t] out, necessity know camera fundamental matrix M and known Object is in three dimensional space coordinate point X, corresponding image pixel coordinates point x.

3 d pose Europe visual angle resolves: and introducing rotational translation matrix [R | t], wherein R is 3*3 spin matrix, and t is flat for 3*1 The amount of shifting to has lower column matrix:

The matrix of the certain point of image pixel plane, X are indicated with x It is expressed as the matrix of the certain point of world coordinate system, M is camera fundamental matrix, that is, has: x=M* [R | t] * X.

Three reference axis matrixes above can be expressed as using a spin matrix:

To find out rotation angleθ, φ, the expression of spin matrix R are as follows:By calculating us It can be indicated with following simple code:

Wherein, the way of atan2 (y, x): atan (y/x) is used when the absolute value of the absolute value of x ratio y is big；Otherwise make With atan (x/y), numerical stability ensure that, actan (y/x) is trigonometric function of negating.

Spin matrix R, the spin matrix of three reference axis:

One rotates around X-axisMatrix:

One rotates the matrix of θ around Y-axis:

One rotates the matrix of φ about the z axis:

Camera 3 d pose Europe visual angle: yaw angle Yaw is exactly the angle φ rotated about the z axis, and idler wheel angle Roll is exactly around Y-axis The angle, θ of rotation, pitch angle Pitch are exactly the angle rotated around X-axis

(passed through using known camera intrinsic parameter (being obtained above by camera calibration) and known 2D image coordinate point Extract the square corner point feature inside object and obtain angular coordinate), and corresponding with space 3d space point (it is customized, sequentially want It always with 2D angle point sequence) solves the outer parameter (i.e. rotating vector, translation vector) of camera, in the present embodiment, uses The solvePNP function that OpenCV is provided solves outer parameter, finally utilizes camera intrinsic parameter, Camera extrinsic number and customized 3d space coordinate points (0.0,0.0,0.0), (1.5,0.0,0.0), (0.0,0.0,1.5) respectively corresponds space origins, Space X axis, space Y axis, space Z axis solve one-to-one image 2D coordinate points；It is provided using OpenCV ProjectPoints function calculates the coordinate points in the corresponding image coordinate of known spatial axial coordinate point, finally by corresponding diagram As coordinate points connect together as the space coordinates of object.

It is handled by object detection algorithms, judges whether there is human body；Picture is required according to the rules to substitute into trained object Body training pattern obtains the bezel locations of human body and the confidence level for human body.According to model training as a result, human body confidence level is big In the judgement someone for being equal to 0.4.Other are judged as no one.Four vertex of desktop or books are obtained simultaneously, with counting later Calculate distance.

Human testing is, to MobileNet-SSD model, to compile protobuf, object using the data set demarcated Detection API is to come training pattern and configuration parameter using protobuf.

Using VGG16 as basic model, the full articulamentum fc6 and fc7 of VGG16 is converted into 3*3 but dilation Rate=6 convolutional layer conv6 and 1*1 convolutional layer conv7 removes dropout layers and fc8 layers, and has increased convolutional layer newly to obtain More characteristic patterns are for predicting offset and confidence.

The input of algorithm is 300*300*3, using conv4_3 (characteristic pattern size 38*38), conv7 (19*19), Conv8_2 (10*10), conv9_2 (5*5), the output of conv10_2 (3*3) and conv11_2 (1*1) are extracted 6 features altogether Figure predicts location and confidence, can predict 38*38*4+19*19*6+10*10*6+5*5*6+3*3*4+ altogether 1*1*4=8732 bounding box (default box).

The different length-width ratio of bounding box: [1,2,3,1/2,1/3], the length and width of bounding box and the calculation formula at center are as follows:

Wherein a_rFor length-width ratio, f_kFor k-th of feature The length or width of figure.

The data that human testing model is read are the data after normalization, that is, need the coordinate mark and obtain The length of the object arrived and wide length and width divided by original image.

Loss function is divided into two parts: positioning loss and Classification Loss, and for counter-example prediction block, positioning loss is zero.

Wherein k is the classification of object, and i is default box number, and j is The number of ground truth box, since each default box only corresponds to a ground truth box, once It is fixed that i takes, and it is fixed that j and k just take, j take it is fixed after, the corresponding object category of guound truth box of j number is determining.

Depth separates convolution: depth separate convolution by Standard convolution be divided into one 1*1 convolution of depth convolution sum i.e. by Point convolution.Depth convolution carries out convolution with single convolution kernel for single input channel, obtains the depth of input channel number, then With a 1*1 convolution, linearly to be combined to the output in depth convolution.

Input feature vector figure convolution kernel exports characteristic pattern

D_F×D_F×M D_K×D_K×M×N D_G×D_G×N

Standard convolution calculation amount: D_K×D_K×M×N×D_F×D_F

Depth separates convolutional calculation amount: D_K×D_K×M×D_F×D_F+M×N×D_F×D_F。

By Face datection algorithm process, face is judged whether there is；It determines in picture after someone, whether detection picture has Face.So face requires picture to substitute into trained model according to the rules, obtain human eye (right and left eyes), nose, mouth (two Angle) 5 points position and confidence level for face.According to model training as a result, judgement of the confidence level more than or equal to 0.4 has Face.Other, which are judged as, does not have face.

Face datection model training is by MTCNN, Multi-task convolutional neural network Human face region detection and face critical point detection have been placed on together, have been based on cascade frame by (multitask convolutional neural networks) 's.PNet, RNet and ONet Three Tiered Network Architecture can be totally divided into.

P-Net full name is Proposal Network, and basic construction is a fully-connected network.Previous step is constructed The image pyramid of completion carries out preliminary feature extraction and calibration frame by a FCN, and carries out Bounding-Box Regression adjusts window and NMS carries out the filtering of most of window.

General Pnet only does detection and face frame returns two tasks, although the size of input is when net definitions 12*12*3, since Pnet only has convolutional layer, the image after resize directly can be fed for network and carry out forward pass by us, only Obtained result just not instead of 1*1*2 and 1*1*4, m*m*2 and m*m*4.Do not have to thus first cut from the figure of resize It takes the figure of various 12*12*3 to be re-fed into network, but is disposably sent into, push back the corresponding 12* of each result further according to result Where 12 figure is inputting picture.

For every figure in pyramid, network forward has obtained what face score and face frame returned after calculating As a result.Face classification score is the three-dimensional matrice m*m*2 in two channels, corresponds to the m*m 12*12 on network inputs picture in fact Sliding sash can extrapolate each sliding sash in original image in conjunction with scaling scale of the current image in pyramid picture Specific coordinate.It first has to be screened according to score, score is lower than the sliding sash of threshold value, excludes.Then pressed down using the non-maximum of nms System, merges remaining sliding sash.After pictures all in pyramid have been handled, nms is recycled to carry out the sliding sash summarized Merge, is then converted into pixel coordinate in original image using the finally corresponding Bbox result of remaining sliding sash, that is, obtain The coordinate of face frame.

R-Net full name is Refine Network, and basic construction is a convolutional neural networks, relative to first layer P-Net for, increase a full articulamentum, therefore can be stringenter for the screening of input data.Pass through P- in picture After Net, many prediction windows can be left, all prediction windows are sent into R-Net by us,

The face frame that Pnet generates is intercepted from original image, and resize to 24*24*3, as Rnet's Input.Output is still score and BBox regression result.Candidate frame to score lower than threshold value is abandoned, remaining candidate frame It is nms to merge, then BBox regression result is mapped on the pixel coordinate of original image again.It is obtained so Rnet is final To be chosen in Pnet result come face frame.

O-Net full name is Output Network, and basic structure is a more complicated convolutional neural networks, relative to More convolutional layers for R-Net.The effect of O-Net and the difference of R-Net are that this layer of structure can be by more supervising The region to identify face is superintended and directed, and the face feature point of people can be returned, five people's face face feature points of final output.

The face frame that Rnet generates is intercepted from original image, and resize to 48*48*3, as Onet's Input.Output is score, BBox regression result and landmark position data.Score is more than that the candidate frame of threshold value is corresponding Bbox regression data and landmark data are saved.Bbox regression data and landmark data are mapped to original In image coordinate.Nms is again carried out to merge face frame.

MTCNN Feature Descriptor mainly includes 3 parts, face/non-face classifier, bounding box recurrence, terrestrial reference positioning.

The cross entropy loss function of face separation are as follows: Wherein p_iFor the probability for being face,For the true tag of background.

Bounding box, which returns to calculate by the intelligent identification distance that read-write scene reads and writes element, returns loss:WhereinFor by the background coordination of neural network forecast, whereinFor actual real background Coordinate.

The intelligent identification distance that terrestrial reference is located through read-write scene read-write element calculates:WhereinFor the terrestrial reference coordinate obtained by neural network forecast,For actual true terrestrial reference coordinate.

The training of the multiple input sources of face detection module are as follows:

P-Net R-Net(α_det=1, α_box=0.5, α_landmark=0.5),

O-Net(α_det=1, α_box=0.5, α_landmark=0.5), wherein N is training samples number, α_jFor the important of task Property,For sample label,For loss function.

In the training process,It is as follows with friendship union IoU (Intersection-over-Union) ratio of y:

0-0.3: non-face

0.65-1.00: face

0.4-0.65:Part face

0.3-0.4: terrestrial reference

The ratio of training sample, negative sample: positive sample: part sample: terrestrial reference=3:1:1:2.

According to the data that Face datection algorithm is demarcated, judge whether posture meets correct reading and writing standard；Desk lamp is taken The camera of load is apart from base plane 15cm, as long as desk lamp is placed within the scope of 30 ° to 150 ° focusing on people, people to camera shooting Within the scope of head distance 80cm, in the case where people's sitting posture correction, camera takes the full header that the picture come can include people And clearly face.Shooting the picture come can satisfy the requirement of human face recognition model calculating.People is in normal read-write, both shoulders Be laid flat head have it is slight bow, in order to evade slightly bow caused by calculate error, this desk lamp carry camera have about 15 ° of elevations angle.

This method judges the condition for reading and writing correct body position for (two conditions must simultaneously meet):

A. people's sitting posture correction without significantly torticollis, is bowed

B. eye distance desktop is greater than 35 centimetres

The case where being respectively Yaw, Roll, Picth according to available three angles of faceform, meeting poor form has It is several below:

1. Yaw is [0,30) in section

2. Yaw [30,45) in section, and Roll is less than -10

3. Yaw is more than or equal to 45, and Roll is less than 0

4. Yaw [- 10,0) in section, and Roll is less than -3

5. Yaw [- 40, -30) in section, and Roll is less than -5

6. Yaw [- 30, -10) in section, and Roll is less than -10

7. Yaw is greater than -40, and Roll is less than -30

It is calculated by distance, judges whether human eye is greater than the distance of correct reading and writing at a distance from desktop；According to object Four apex coordinates of desktop or books that body training pattern returns, calculate the cornerwise intersection of quadrangle according to four vertex Point coordinate, and according to obtained human eye coordinates (central point of the coordinate of right and left eyes).Crosspoint is calculated to human eye central point Distance.It can be calculated according to the proportionate relationship of this distance and camera focus, the actual range of human eye to desktop or books.When This distance then thinks the poor form less than 35 centimetres.

Judge the poor form accumulated in the stipulated time and whether be greater than defined standard apart from wrong summation, and according to knot Fruit carries out voice reminder.

The basic principle that the present invention designs is as follows:

FCN (full volume machine network)

Full convolutional network is exactly the full articulamentum for eliminating traditional convolutional network, and deconvolution is then carried out to it to last The feature map of a convolutional layer (or other suitable convolutional layers) is up-sampled, it is made to be restored to the ruler of original image Very little (or other), and can be carried out the prediction of a classification to each pixel of deconvolution image, while remaining original There is the spatial information of image.It, can also be by extracting other convolutional layers meanwhile during deconvolution operates image Deconvolution result final image is predicted, it is suitable to select meeting so that result is more preferable, more finely.

IoU

Specific item logo image for some image and the prediction block that this specific item logo image is demarcated, final calibration Prediction block and certain correlation of natural frame (it is generally necessary to manually demarcate) of true subgraph be called IOU (Intersection over Union), the standard being commonly used be two frames intersection area with merge the sum of area.

Bounding-Box regression:

Solve the problems, such as: when IOU is less than some value, a kind of way be directly its corresponding prediction result is abandoned, and The purpose of Bounding-Box regression is finely adjusted to this prediction window, and true value is close to.

Specific logic

Inside image detection, child window generally uses four dimensional vectors (x, y, w, h) to indicate, represents child window center institute Corresponding mother's image coordinate and itself wide height, target is in the case of back prediction window is excessive for real window deviation Under, so that prediction window obtains the closer window with true value by certain transformation.

Among actual use, the result for having already passed through transformation that the input and output of transformation are provided according to specific algorithm and most The transformation of suitable result eventually, it can be understood as the linear regression of a loss function.

NMS (non-maxima suppression)

As its name suggests, non-maxima suppression is exactly the element that inhibition is not maximum.It, can be with inside object detection field Quickly remove the prediction block that registration is very high and calibration is relatively inaccurate using this method, but this method is for the mesh of coincidence Mark detection is unfriendly.

Soft-NMS

A kind of improved method of target detection is overlapped for optimization.Core is not delete directly when carrying out NMS Repressed object, but reduce its confidence level.Unified deletion is carried out in a last unified confidence level after processing.

PRelu

In MTCNN, the activation primitive that convolutional network uses is PRelu, the Relu with parameter with parameter, relatively The way of negative value is filtered out in Relu, PRule has carried out addition parameter to negative value rather than directly filtered out, and this way can give algorithm A possibility that bringing more calculation amounts and more over-fittings, but due to remaining more information, it is also possible to training As a result fitting performance is more preferable.

The above is presently preferred embodiments of the present invention, but the present invention should not be limited to embodiment and attached drawing institute public affairs The content opened both falls within protection of the present invention so all do not depart from the lower equivalent or modification completed of spirit disclosed in this invention Range.

Claims

1. a kind of intelligent identification of read-write scene read-write element, which is characterized in that step includes:

S1 is obtained and is read and write the original image of element, the image information of desktop or books in read-write scene；

2. the intelligent identification of read-write scene read-write element as described in claim 1, it is characterised in that: examined using algorithm Whether have the comprising the concrete steps that of human body according to the rules require picture substitute into trained object training model, obtain human body if surveying Bezel locations and confidence level for human body.

3. as claimed in claim 2 read-write scene read-write element intelligent identification, it is characterised in that: Face datection be It determines in picture after someone, whether detection picture has the face, and picture is required to substitute into trained Face datection mould according to fixed Type, obtain right and left eyes, nose, two corners of the mouths 5 points position and confidence level for face.

4. the intelligent identification of read-write scene read-write element as described in claim 1, it is characterised in that: determine desktop, book Before the 3 d space coordinate of this and face, need first to obtain that the intrinsic parameter of camera, the intrinsic parameter of the camera include basic ginseng Several and distortion factor variable, the basic parameter includes image optical axis principal point, X, Y-direction focal length, the distortion factor vector packet Include tangential distortion coefficient and coefficient of radial distortion.

5. the intelligent identification of read-write scene read-write element as claimed in claim 4, it is characterised in that: use pin-hole imaging The three-dimensional coordinate of object is calculated in model, sets a projection centre, and principal point is the image principal point across optical axis, (X, Y, Z) For the object coordinates under space coordinates, (x, y, z) is image pixel coordinates.

6. read-write scene coordinate identification as claimed in claim 5, it is characterised in that: camera intrinsic parameter is utilized, outside camera Parameter and customized 3d space coordinate points (0.0,0.0,0.0), (1.5,0.0,0.0), (0.0,0.0,1.5) is distinguished Corresponding space origins, space X axis, space Y axis, space Z axis solve one-to-one image 2D coordinate points；Use OpenCV The projectPoints function of offer calculates the coordinate points in the corresponding image coordinate of known spatial axial coordinate point, finally will Correspondence image coordinate points connect together as the space coordinates of object.

7. the intelligent identification of read-write scene read-write element as described in claim 1, it is characterised in that: the Face datection Training is to detect human face region by MTCNN and be placed on together with face critical point detection, based on cascade frame, always Body is divided into PNet, RNet and ONet Three Tiered Network Architecture.

8. the intelligent identification of read-write scene read-write element as claimed in claim 7, it is characterised in that: the MTCNN is special Sign description mainly includes three parts, and face/non-face classifier, bounding box return and terrestrial reference positioning.