CN114882483A

CN114882483A - Book checking method based on computer vision

Info

Publication number: CN114882483A
Application number: CN202210337008.1A
Authority: CN
Inventors: 陈力军; 刘佳; 顾桥磊; 徐毅晖; 陈星宇; 鄢伟
Original assignee: Jiangsu Tuke Robot Co ltd; Nanjing University
Current assignee: Jiangsu Tuke Robot Co ltd; Nanjing University
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-08-09

Abstract

The invention introduces a book checking method based on computer vision, which comprises the following steps: s1, acquiring a picture sequence of the single-layer bookshelf through a camera; s2, performing optical flow estimation on the picture sequence to obtain the moving direction and distance of each pixel between two adjacent frames; s3, constructing and training a rotary example target detection model, carrying out example target detection on the picture sequence, carrying out example segmentation, obtaining the positions of the book spine and the book searching number in the current picture, and distributing the book spine and the book searching number to a corresponding book; step S4, training a text recognition model to obtain book name and book searching number texts contained in each picture and the position of the texts; step S5, carrying out target tracking to obtain the specific position and all recognition results of each book; and step S6, correcting and matching all the recognition results to obtain a real book name result, and outputting the book instance position and the matched book information as a checking result. The high efficiency and the high accuracy of book inventory are finally realized through the steps.

Description

Book checking method based on computer vision

Technical Field

The invention belongs to the field of computer vision design, and particularly relates to a book checking method based on computer vision.

Background

The books in the library have very high mobility, and need often get up new, put aside and arrange the books that the reader returned, and the error of the reader returning reading and the whole shelf process of staff can lead to books to miss the shelf, and this can influence staff's whole shelf efficiency in turn to make the reader can not find the books that want. At present, book management of a large number of libraries still adopts a mode of scanning book bar codes, and the process consumes time and labor: manual scanning is needed, and only one book can be scanned at a time; in consideration of the beauty of books and the security of bar codes, the bar codes are usually attached to the interior of books, and the books are opened during scanning. In order to solve the problem, technical personnel put forward the idea of an automatic library, and a small number of libraries adopt intelligent bookshelves, so that the library needs to be transformed in a whole set, and the cost is high and the process is complicated. Therefore, some libraries already adopt a more intelligent checking mode, bar codes needing line-of-sight distance scanning are replaced by RFID chips, books are automatically checked through RFID readers and antennas arranged on a movable platform, manual intervention is not needed in the scanning process, and checking efficiency is greatly improved.

Computer vision is a technology for recognizing and understanding a target instead of human eyes, and a machine vision product such as a camera converts a captured image into image signals, and a vision system performs various operations on the signals to extract features of the target and interpret the captured contents. Compared with human, the computer vision has the advantages of low cost, high recognition speed, stable system and easy integration. Especially in industrial scenarios, a lot of repetitive, delicate labor is required, and the computer vision system can obtain more stable, reliable and continuous results. Just because the computer vision system can automatically and quickly acquire a large amount of information, the computer vision system is widely applied to the fields of industrial manufacturing, quality detection, medical monitoring, traffic monitoring, identity authentication and the like, and the demand for the computer vision system in China is increasing.

Chinese patent CN111814935A discloses a book positioning method based on an inventory robot, which comprises the following steps: s1, data acquisition: the robot acquires the electronic tags of the books by using an RFID (radio frequency identification) technology, and transmits the electronic tags, the coordinates of the robot and the height of an antenna to the processor module; s2, data processing and conversion: extracting book codes from the electronic tags, storing the electronic tags, the book codes, the robot coordinates and the antenna height in a data storage module, reading data through a processor, converting the data into an array matrix form, and carrying out averaging processing on each data; s3, book positioning: the data processed by the S2 is input into the book positioning module, the position labels are found through the network model, the position labels and the book information are transmitted to the data reading module together to realize book positioning, the position of the book can be positioned more accurately and more rapidly by utilizing the network model to extract the characteristics, the workload of a library manager is greatly reduced, and the working efficiency is improved. But it obviously has huge problems in the processing of data and the visual positioning processing of the camera.

Chinese patent CN112464682A discloses a book positioning device and method of an intelligent bookshelf based on RFID technology, the device includes: a label; the route selector is arranged corresponding to each layer of the intelligent bookshelf; the intelligent bookshelf comprises a plurality of antennae, wherein the antennae are uniformly distributed on each layer of the intelligent bookshelf; a reader/writer; and the controller is used for sending a control sending instruction to control the route selector to select the corresponding antenna to work, and acquiring the tag information read by the reader, the antenna code corresponding to the tag information and the RSSI value corresponding to the tag information, wherein if the tag information corresponds to a plurality of antenna codes, the antenna code with the larger RSSI value is stored. The device sets up a plurality of antennas at every layer of bookshelf, can guarantee that every label all is discerned, when reading by a plurality of antennas with a plurality of labels, selects keeping of RSSI value, can get rid of the problem of locating the position shelf to solve and read the problem of full label and the accurate contradiction in location, reach the purpose of accurate positioning books.

The rapid development of computer vision technology promotes the intelligent upgrading and reconstruction of various industries, and also brings new methods and opportunities for library inventory. The RFID library inventory has been fully automated, but in libraries that have never been deployed, a chip needs to be manually inserted into a book, and the computer vision technology is a technology that can be used when opening a box and does not need any preparation work in advance; in addition, even if RFID high-frequency inventory still has certain probability to miss reading in very close inventory distance, on the basis of high-frequency inventory, add computer vision technique, because camera shooting can not miss any book, can supply the book that misses greatly. However, in practical applications, the above techniques face several challenges: in order to meet the grabbing requirement of the mechanical arm, the book positioning precision must be high enough; and the books in the library are densely arranged and have highly similar textures, and the books in the photos are numerous, so that the obstacle of book identification is increased, and the positioning technology is difficult to realize.

Disclosure of Invention

In order to solve the problems, under the condition that the environment is not required to be modified, a computer vision technology is used for giving an accurate recognition result in a short time, high efficiency and high accuracy of library inventory are achieved, a library manager can conveniently position wrong books, and a reader can find needed books more quickly and conveniently.

In order to achieve the effect, the invention designs a book checking method based on computer vision.

A book checking method based on computer vision comprises the following steps:

s1, recording a video through a camera, acquiring a video of a single-layer bookshelf, and splitting the video frame by frame to obtain a picture sequence;

s2, performing optical flow estimation on the picture sequence to obtain optical flow data, and further obtaining the moving direction and distance of each pixel between two adjacent frames;

s3, constructing and training a rotary example target detection model, carrying out example target detection on the picture sequence, further carrying out example segmentation to obtain the positions of the spine and the book searching number of the current picture, and distributing the book searching number to the corresponding book according to the coordinates;

step S4, training a text recognition model, and performing text recognition on the images of the book spine and book searching number areas recognized in the picture sequence to obtain book names and book searching number texts contained in each picture and the positions of the texts;

step S5, based on the position coordinates and corresponding texts of the books and the book searching numbers identified in the picture sequence, obtaining the corresponding relation between the book instances contained in every two pictures according to the optical flow data, carrying out target tracking, and tracking the appearance of the same book or the book searching number instance in all the pictures, thereby obtaining the specific position of each book in the single-layer bookshelf and all the identification results of the same book in different pictures;

and step S6, correcting and matching all the identification texts of each book or book index number example based on the book database candidate set to obtain a real book name result, and outputting the book example position and the matched book information as a checking result.

Preferably, in step S2, the method for performing optical flow estimation on a picture sequence includes:

step S21, constructing an optical flow estimation model realized based on PWC-Net, wherein the core is to use a multi-scale network to estimate the optical flow, calculate the optical flow from a low-resolution picture, input low-resolution optical flow data to a network with higher resolution step by step, calculate a new high-resolution optical flow, and finally obtain the optical flow data of the original picture size;

step S22, shooting videos of a plurality of bookshelf real books in advance, training an optical flow estimation model on the shot videos in a self-supervision mode, and enabling the model to learn how pixel points of a previous picture move to pixel points of a next picture;

step S23, starting from the first picture in the picture sequence, sequentially calculating optical flow data between two adjacent pictures; and correspondingly zooming the estimated optical flow data according to the ratio of the original image size to the model input size.

Preferably, in step S3, the method for constructing and training the rotating instance target detection model includes:

s301, synthesizing an instance segmentation data set based on a bookshelf picture shot really;

step S302, training a target detection model Mask-RCNN based on a rotation candidate frame by using a real picture and a synthetic data set to obtain a robust and highly generalized model:

step S303, in the first stage, a one-stage model suitable for multi-scene and high-generalization full training is obtained based on synthetic data set training;

and step S304, in the second stage, fine tuning training is carried out based on the real picture, so that the model in the first stage can better adapt to the real scene, the problem of inconsistent distribution of training data caused by noise in the synthetic data set is corrected, and the high-precision target detection model fitting the real prediction scene is obtained.

Preferably, in step S3, the method for instance-dividing the picture sequence includes:

s311, sending each picture in the picture sequence into a target detection model to obtain all the appeared enclosing frames of the spine instance and the book searching number instance; specifically, the output of the model is a rotating bounding box which contains the coordinates of the original rectangle and the rotating angle of the original rectangle;

step S312, calculating to obtain four-corner coordinates of a rotary surrounding frame according to the obtained position coordinates and the inclination angle of the rectangular frame, wherein the surrounded image area is the really recognized example of the spine or the book searching number;

s313, filtering the results of the book spine and the book searching number example bounding box with too small area, judging whether the coordinates of four corners of the bounding box are all in the boundary of the picture, and if angular points beyond the range exist, removing the corresponding example identification results;

step S314, dividing the book spine example enclosure frame into a left line pair, a right line pair and an upper line pair and a lower line pair, only taking the left line pair and the right line pair, respectively extending the left edge and the right edge of the picture to obtain a polygon formed by four intersection points as a new example enclosure frame, which is equivalent to completing the left edge and the right edge segmentation of the book;

step S315, the above operation is not needed for the bounding box of the book searching number example, and the range of the original bounding box is directly taken as a segmentation result;

and step S316, respectively acting non-maximum value inhibition on the spine and the index number example segmentation boxes, and only keeping the example which has high confidence and less overlap with other examples as a final segmentation result.

Preferably, in step S4, the method for training the text recognition model and performing text recognition includes:

step S401, training a text recognition model based on an ICDAR 2019-LSVT Chinese text data set, wherein the text recognition model consists of a detection part model and a recognition part model: constructing a detection model based on DB, and adopting mixed precision training; the recognition model is composed of CRNN;

s402, recognizing position coordinates of a text instance appearing in a picture sequence and a corresponding text by using a trained text recognition model;

step S403, for each picture in the picture sequence, distributing the text corresponding to the text box to each spine and book searching number example according to the segmented spine and book searching number positions and the identified text box position, wherein the text corresponding to each spine and book searching number box is the splicing result of the distributed text; thus obtaining the position frame of each book spine and each book searching number and the corresponding identified text, namely the pair of < picture, position coordinate and identification text >; the book name recognition result comprises information of a book title, an author, a publishing company and the like, the book searching number recognition result comprises information of a book searching number English letter, a library name and the like, and Chinese characters in the book searching number need to be removed;

step S404, calculating the spine instance to which the book searching number belongs according to the spine and book searching number coordinates obtained by instance segmentation, and allocating the book searching number identification text to the spine instance to finally obtain a result pair of < the picture, the book position, the book name and the book searching number identification text >.

Preferably, in step S5, the target tracking process includes: calculating the center of all book spines and book searching number examples on the previous picture to move to which book spine or book searching number example area in the next picture through optical flow based on optical flow data, thereby matching repeated book examples appearing in the previous picture and the next picture, aggregating the < picture, book position, book name and book searching number identification text > pairs belonging to the same book, obtaining all identification results of the same book in different frames, wherein the form of the identification results is < example coordinate, book name and book searching number text identification sequence >; the coordinates of each book instance at this time are global coordinates in the bookshelf layer, and the starting or ending position of the bookshelf is the origin.

Preferably, in the step S6, the method for correcting and matching based on the book database candidate set includes:

for the < example coordinate, book name and index number text recognition sequence > obtained by target tracking, selecting the recognition result with the highest confidence coefficient from the recognition sequences: firstly, carrying out exact matching on the book searching number in a database of the whole library (the matching is successful if all characters are the same), and taking the matched book information as an inventory result if any one of the book searching number identification sequences is successfully matched; otherwise, performing book name matching, performing word segmentation on any recognition result in the book name recognition sequence, screening candidate items with the same word after word segmentation in the local candidate set of the book at the current layer, then calculating the similarity between the candidate items and the current recognition book name by using TF-IDF, selecting the book name with the highest similarity and exceeding a set similarity threshold, and if the occurrence times are the same, selecting the correction result with the highest score.

Preferably, in step S301, the synthesized instance division data set includes a real book picture and a synthesized book picture; the synthesis method comprises the following steps:

step S3011, arranging a camera under a scene of a real bookshelf, and moving the camera at a constant speed to shoot book pictures on different layers of the bookshelf, so as to obtain a plurality of pictures of the real bookshelf;

step S3012, synthesizing and enhancing an example segmentation data set based on the shot real bookshelf picture: cutting out the book spine of each book from the real picture, and performing data enhancement such as random rotation, random deformation, random illumination change, random white noise increase, random light spot addition and the like on the book spine to obtain a plurality of new book spine examples;

step S3013, constructing a background picture library based on the real empty bookshelf background picture and the background picture randomly selected from the ImageNet data set, randomly extracting a certain number of backgrounds from the background picture library, randomly pasting the synthesized new spine instance to random positions in the backgrounds, obtaining the coordinates of the four corners of the new spine instance under the new background, and automatically adding the coordinates into the synthesized labeling file.

Preferably, in step S6, before performing the modified matching on the recognition results of the title text and the index text, the method further includes: and inquiring the information of the books on shelves on the corresponding bookshelf layer from the library database according to the number of the bookshelf and the number of the layer scanned by the current camera, and taking the information as a matched candidate book set.

Preferably, in step S1, the method for splitting video frame by frame includes: reading the video frame by frame, and carrying out operations such as distortion correction and image rotation on each frame of picture to obtain a distortion-free bookshelf front view.

The application has the advantages and effects as follows:

1. the method and the device solve the problems of large number of missed detection and low precision of the traditional example segmentation in the scenes of dense books and inclined books by using the rotating example target detection module; meanwhile, the target tracking algorithm is used for ensuring that the same book has the chances of being segmented and detected by the instance for many times and text recognition for many times, and the recognition results in different frames are allowed to be comprehensively considered, so that the highest confidence coefficient is found out to be used as a final result, and the problems of book omission and inaccurate checking results are solved to the great extent.

2. According to the method and the device, from the perspective of user experience, the example segmentation and target tracking technology is adopted, the global position coordinates of each book on the corresponding bookshelf layer are obtained, a library manager can conveniently perform wrong book positioning, and readers can find needed books more quickly and conveniently.

3. The method and the device for checking the library use the computer vision related technology to check the library, do not need to transform the environment, and can be directly deployed.

4. The computer vision identification module realized by using the deep learning technology has high robustness and reliability, and can give an accurate identification result in a short time.

The foregoing description is only an overview of the technical solutions of the present application, so that the technical means of the present application can be more clearly understood and the present application can be implemented according to the content of the description, and in order to make the above and other objects, features and advantages of the present application more clearly understood, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a flow chart of a book inventory method based on computer vision provided by the present application;

FIG. 2 is a diagram illustrating a disposition position relationship between a camera and a bookshelf provided by the present application;

FIG. 3 is a structural diagram of an optical flow prediction model provided in the present application;

FIG. 4 is a diagram illustrating the optical flow estimation result for a picture sequence according to the present disclosure;

FIG. 5 is a flow chart of the construction of a composite data set provided herein;

FIG. 6 is a block diagram of a rotating example object detection model provided herein;

FIG. 7 is a diagram illustrating an exemplary segmentation of a picture sequence according to the present disclosure;

fig. 8 is an effect diagram of text recognition on a picture sequence provided in the present application;

fig. 9 is an effect diagram of performing target tracking on a picture sequence according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. In the following description, specific details such as specific configurations and components are provided only to help the embodiments of the present application be fully understood. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present application. In addition, descriptions of well-known functions and constructions are omitted in the embodiments for clarity and conciseness.

It should be appreciated that reference throughout this specification to "one embodiment" or "the embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrase "one embodiment" or "the present embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Further, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, B exists alone, and A and B exist at the same time, and the term "/and" is used herein to describe another association object relationship, which means that two relationships may exist, for example, A/and B, may mean: a alone, and both a and B alone, and further, the character "/" in this document generally means that the former and latter associated objects are in an "or" relationship.

The term "at least one" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, at least one of a and B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion.

Example 1

The embodiment mainly introduces the overall design of a book checking method based on computer vision.

A computer vision based book inventory method, the method comprising:

recording a video through a camera, and acquiring a video of the single-layer bookshelf;

splitting the video frame by frame to obtain a picture sequence, carrying out self-supervision training on an optical flow estimation model, carrying out optical flow estimation on every two pictures, and further obtaining the moving direction and distance of each pixel between two adjacent frames;

the target detection model is realized based on the rotation candidate box, and the example segmentation precision under the book dense arrangement scene is improved: constructing a synthetic data set and training a target detection model, carrying out example target detection on the picture sequence, further carrying out example segmentation to obtain position coordinates of each book and a book searching number example, and distributing the book searching number to the corresponding book according to the coordinates;

training a text recognition model, and performing text recognition on the images of the book spine examples and the book searching number areas recognized in the picture sequence to obtain book name and book searching number texts contained in each picture and the position of the text;

based on the position coordinates and corresponding texts of the books and the book searching numbers identified in the picture sequence, obtaining the corresponding relation between book instances contained in every two pictures according to the optical flow data, tracking the appearance of the same book or book searching number instance in all the pictures, and thus obtaining the specific position of each book in the single-layer bookshelf and all the identification results of the same book in different pictures;

and correcting and matching all the identification texts of each book or book index number example based on the book database candidate set to obtain a real book name result, and outputting the book example position and the matched book information as a checking result.

The camera, i.e. the optical image sensor, any device that converts an optical image on a photosensitive surface into an electrical signal in a proportional relationship with the optical image by using the photoelectric conversion function of an optoelectronic device, is within the meaning of the camera in this application.

The video of the single-layer bookshelf can be the video of one layer of books on a certain bookshelf, can also refer to the video of the same layer of books on different bookshelves at the same height, or the video of several lattices in one layer of a certain bookshelf, and the videos are shot by a camera with a video shooting function. Wherein a picture sequence refers to video frames extracted from a video by a computer program at certain intervals. Since optical flow data between adjacent pictures needs to be calculated, the contents of the adjacent pictures are required to overlap to some extent, which may result in optical flow calculation failure or calculation errors. The optical flow is a parameter describing the moving direction and the moving distance of pixels in adjacent pictures, that is, to which pixel point of the next picture the pixel point in the previous picture has moved.

The process of constructing and training the optical flow estimation model, the rotary instance target detection model and the text recognition model can be completed in advance before the camera deploys and records videos, namely, the three models can be trained by videos or pictures which are shot in advance, model weights obtained by training are stored on a server, and only corresponding weights need to be loaded directly when a visual inventory program is deployed and operated.

Wherein, the bookshelf of different libraries, books and environmental factor difference such as illumination are great, before deploying new library operation, need further fine setting train current model weight, and concrete step is: and shooting a small amount of videos or pictures in a new library, training new data for an old model needing fine tuning by adopting a lower learning rate until the verification effect reaches a higher level, stopping fine tuning, replacing the old weight file on the server with the fine-tuned new model weight file, and restarting the checking program.

Each book instance will appear in multiple frames in the video, and when tracking and locating each uniquely appearing spine instance in the video, it is necessary to know in which frames each instance appears, and the corresponding relationships of the instances in these frames, which can be described by the optical flow data between the frames; constructing and training an optical flow estimation model based on PWC-Net, calculating optical flow parameters of the picture sequence, and obtaining a pixel corresponding relation of the picture sequence; constructing and training a Mask-RCNN model for detecting a rotating target, carrying out spine segmentation to separate a plurality of books which are continuously arranged in a picture, realizing pixel-level positioning of the books, carrying out book searching number segmentation to distinguish regions belonging to book searching numbers on each spine example, and reducing the difficulty of recognizing book searching number characters; performing target tracking on all instances by using an optical flow and a pixel level segmentation result, calculating the area of which spine or book searching number instance the centers of all spine or book searching number instances on the previous picture move to the next picture through the optical flow so as to match repeated book and book searching number instances appearing in the previous picture and the next picture, aggregating < picture, position coordinate > pairs belonging to the same book or the same book searching number, and calculating the global position of the instances on the current layer as a target tracking result; constructing and training a text recognition model based on DB and CRNN, performing once text recognition and aggregation on all appearance positions of each example after target tracking, and obtaining a plurality of text recognition results of the same book name or book searching number; and inquiring a book database of the checking library to obtain the book information of the current bookshelf layer, correcting errors of the book name or book searching number text recognition results, and matching to obtain a final checking result.

Optionally, before the camera is turned on to record the bookshelf video, the method further includes: installing a proper light supplement lamp according to the light intensity of the bookshelf and adjusting the brightness; and selecting a proper working distance, namely the distance between the camera and the bookshelf, according to the arrangement condition of the books and the focal length and the visual angle of the used camera, and adjusting various parameters of the camera.

Preferably, the process of recording the video of the single-layer book on the bookshelf by the camera comprises the following steps: fixing a camera on a movable platform, fixing a working distance and a moving speed for a bookshelf layer to be shot according to ambient light and book arrangement conditions, enabling a camera focus to be located on a shot book, and adjusting camera parameters such as exposure time, gain, white balance and the like to ensure that the camera can shoot a very clear and reliable result; and setting the camera to be in a video mode, and scanning by moving the platform at a constant speed until the tail end of the bookshelf, so as to obtain a video of a layer of books on the bookshelf.

Preferably, the video is divided into picture sequences frame by frame, and the picture sequences mentioned later in this application all refer to a series of pictures obtained by video division. In order to accurately calculate the optical flow between video frames with the minimum calculation cost, the moving speed needs to be adjusted, so that the interval between picture sequences cannot be too small, the lower calculation cost is ensured, and a certain overlap is required between adjacent pictures, so that the accuracy of the obtained optical flow data is ensured.

Optionally, camera parameters are calibrated, and the calibrated parameters are used for correcting the picture sequence to eliminate barrel distortion of the picture.

Optionally, before performing the modified matching on the recognition result of the book name and the book number text, the method further includes: and inquiring the information of the books on shelves on the corresponding bookshelf layer from the library database according to the number of the bookshelf and the number of the layer scanned by the current camera, and taking the information as a matched candidate book set.

Optionally, the process of performing optical flow prediction on the picture sequence includes:

the method comprises the steps of constructing an optical flow estimation model realized based on PWC-Net, wherein the core of the optical flow estimation model is that a multi-scale network is used for estimating an optical flow, the optical flow is calculated from a low-resolution picture, low-resolution optical flow data are input to a network with higher resolution step by step, a new high-resolution optical flow is calculated, and finally optical flow data of the size of an original image are obtained;

shooting videos of real books of a plurality of bookshelves in advance, training an optical flow estimation model on the shot videos in a self-supervision mode, and enabling the model to learn how pixel points of a previous picture move to pixel points of a next picture;

sequentially calculating optical flow data between two adjacent pictures from a first picture in the picture sequence; and correspondingly zooming the estimated optical flow data according to the ratio of the original image size to the model input size.

Optionally, the process of constructing and training the rotating instance target detection model includes:

synthesizing an instance segmentation data set based on a bookshelf picture which is actually shot;

training a target detection model Mask-RCNN based on a rotating candidate frame (RotatedRegion probable) by using a real picture and a synthetic data set to obtain a robust and highly generalized model:

the method comprises the following steps that in the first stage, a one-stage model which is suitable for multi-scene and high in generalization performance and is used for full training is obtained based on synthetic data set training;

and in the second stage, fine tuning training is carried out based on a real picture, so that a model in the first stage can better adapt to a real scene, the problem of inconsistent distribution of training data caused by noise in a synthetic data set is corrected, and a high-precision target detection model fitting a real prediction scene is obtained.

Optionally, the process of synthesizing the segmented data set comprises:

based on the book picture of the bookshelf shot really, according to the marked positions of the book spine and the book searching number, the corresponding book is cut out from the picture by utilizing perspective transformation, and because the book searching number is in the book spine area, after the book instance is cut out, the coordinate of the cut book searching number can be obtained by calculating according to the relative positions of the book searching number and the book searching number;

shooting a book-free background picture of a real bookshelf, randomly extracting a plurality of natural background pictures from the ImageNet data set, and combining the two background pictures to obtain a candidate synthetic background picture library;

randomly extracting a picture from a background picture library, performing random size expansion on the picture, and randomly increasing or decreasing the contrast to simulate various scenes in a real environment;

and pasting a plurality of cut books in the background picture to synthesize a new picture, randomly generating an interval between the book and the previous book or the image boundary when pasting the books, and performing operations of randomly stretching, rotating, lighting change, adding random light spots and noise, adjusting contrast and definition and the like on the pasted books. Pasting the converted book instance to the calculated position of the background picture, and taking the position coordinate as a new label until the generated pasting position coordinate exceeds the boundary of the background picture. According to the process, a large number of synthetic book pictures and corresponding labels can be generated;

the marked coordinates of the book searching number are correspondingly converted according to the pasting position. In addition, the coordinate transformation of the book index can be separated from the spine transformation, and operations such as translation, coordinate transformation, illumination transformation and the like are carried out independently as long as the transformed coordinates are still in the range enclosed by the spine.

Optionally, the process of example partitioning the picture sequence includes:

and (4) sending each picture in the picture sequence into a target detection model to obtain all the appeared book spine instances and Bounding boxes (Bounding boxes) of the book index instances. Specifically, the output of the model is a rotating bounding box which contains the coordinates of the original rectangle and the rotating angle of the original rectangle;

according to the obtained position coordinates and the inclination angles of the rectangular frame, four-corner coordinates of the rotary surrounding frame are obtained through calculation, and the image area surrounded by the four-corner coordinates is the instance of the actually identified book spine or book searching number;

filtering the results of the book spine and the book searching number example bounding boxes with too small area, judging whether the coordinates of the four corners of the bounding boxes are all in the boundary of the picture, and if angular points which exceed the range exist, removing the corresponding example identification results;

for the book spine example surrounding frame, dividing the book spine example surrounding frame into a left line pair, a right line pair and an upper line pair and a lower line pair, only taking the left line pair and the right line pair, respectively extending the left line and the right line to the upper edge and the lower edge of the picture, and obtaining a polygon formed by four intersection points as a new example surrounding frame, which is equivalent to completing the left and right edge division of the book;

for the bounding box of the suo shu number example, the operation is not needed, and the range of the original bounding box is directly taken as a segmentation result;

non-maximum suppression is respectively acted on the spine and the book number example segmentation boxes, and only the example with high confidence and less overlap with other examples is reserved as a final segmentation result. That is, there may be some intersection of the bounding boxes of the tilted book and the upright book, but their overlapping areas account for a smaller proportion of the area of the two bounding boxes, so both will be retained, ensuring that the tilted and upright books in the picture can be split simultaneously.

Optionally, the text recognition process includes:

training a text recognition model based on an ICDAR 2019-LSVT Chinese text data set, wherein the text recognition model consists of a detection part model and a recognition part model: constructing a detection model based on DB, and adopting mixed precision training; the recognition model is composed of CRNN.

Recognizing the position coordinates of the text examples appearing in the picture sequence and the corresponding texts by using the trained text recognition model;

and for each picture in the picture sequence, distributing the text corresponding to the text box to each spine and book searching number example according to the segmented spine and book searching number positions and the identified text box position, wherein the text corresponding to each spine and book searching number box is the splicing result of the distributed text. This results in a position box for each spine, index number and corresponding recognized text, i.e., < picture, position coordinates, recognized text > pair. The book name recognition result comprises information of a book title, an author, a publishing company and the like, the book searching number recognition result comprises information of a book searching number English letter, a library name and the like, and Chinese characters in the book searching number need to be removed;

and calculating the spine instance to which the book index belongs according to the spine and book index coordinates obtained by the instance segmentation, and allocating the book index identification text to the spine instance to finally obtain a result pair of < the picture, the book position, the book name and the book index identification text >.

Optionally, the target tracking process includes:

and calculating the center of all the book spine and book searching number examples on the previous picture to which the book spine or book searching number example is moved to the area of the next picture through optical flow based on the optical flow data, so as to match repeated book examples appearing in the previous picture and the next picture, and aggregating the < picture, book position, book name and book searching number identification text > pairs belonging to the same book to obtain all the identification results of the same book in different frames, wherein the form of the identification results is changed into < example coordinate, book name and book searching number text identification sequence >. The coordinates of each book instance at this time are global coordinates in the bookshelf layer, and the starting or ending position of the bookshelf is the origin.

Optionally, the process of correcting the text recognition result of the matched book name and the book searching number according to the library inventory includes:

for the < example coordinate, book name and index number text recognition sequence > obtained by target tracking, selecting the recognition result with the highest confidence coefficient from the recognition sequences: firstly, carrying out exact matching on the book searching number in a database of the whole library (the matching is successful if all characters are the same), and taking the matched book information as an inventory result if any one of the book searching number identification sequences is successfully matched; otherwise, performing book name matching, performing word segmentation on any recognition result in the book name recognition sequence, screening candidate items with the same word after word segmentation in a local candidate set of the book at the current layer, then calculating the similarity score of the current book name and the candidate items by using TF-IDF, selecting the candidate book name with the highest score as a correction result, finally selecting the correction result with the highest occurrence frequency from the book name correction results in the queue as an inventory result, and if the occurrence frequency is the same, selecting the correction result with the highest score.

The local book candidate set at the current layer only belongs to the currently scanned bookshelf layer and is a subset of the data set of the whole library.

From the perspective of user experience, by adopting the example segmentation and target tracking technology, the global position coordinates of each book on the corresponding bookshelf layer can be obtained, a library manager can conveniently position wrong books, and a reader can find needed books more quickly and conveniently. On one hand, the library is checked by using a computer vision related technology, the environment is not required to be modified, and the library can be directly deployed; on the other hand, the computer vision recognition module realized based on the deep learning technology has high robustness and reliability, and can give an accurate recognition result in a short time. Furthermore, from the technical point of view, the rotating example target detection module can solve the problems of a large number of missed detections and low precision of the traditional example segmentation in the scenes of dense books and inclined books; the target tracking algorithm can ensure that the same book has the chances of being segmented and detected by the instance for many times and recognizing texts for many times, and allows the recognition results in different frames to be comprehensively considered, so that the highest confidence coefficient is found out as a final result, and the problems of book omission and inaccurate checking result are solved to a great extent.

Example 2

Based on the foregoing embodiment 1, this embodiment mainly introduces a specific book checking method based on computer vision.

FIG. 1 is a flow chart of a book inventory method based on computer vision in one embodiment. The method comprises the following steps:

step 1, recording a video through a camera, acquiring a video of a single-layer book of a bookshelf, and splitting the video frame by frame to obtain a whole-layer picture sequence;

step 2, constructing and training a rotary light stream estimation model, carrying out light stream estimation on all adjacent picture pairs in the picture sequence to obtain light stream data corresponding to the picture sequence, and identifying the moving direction and distance of picture pixels;

step 3, constructing and training a rotating example target detection model, and carrying out example positioning on each picture in the picture sequence to obtain position coordinates of the book and book searching number examples contained in each picture;

step 4, training a text recognition model, performing text recognition on the images of the books or the book searching number examples appearing in each picture to obtain recognition characters of each example, and distributing the book searching number recognition texts to the corresponding book examples;

step 5, executing target tracking according to the optical flow and the picture coordinate information of the book and book searching number examples, tracking the moving direction and distance of each book between adjacent pictures, finally obtaining the global coordinate of each book example on the current layer of the bookshelf, and merging the book name and book searching number identification results belonging to the same book in different pictures;

step 6, correcting the books and the book searching number recognition texts, and then matching based on the checking book database to obtain a book name and book searching number matching result list of the same book;

and 7, calculating the most possible matching items according to the exact matching of the book searching number and the book name matching score from the book name and book searching number matching result list of the same book. And finally combining the global position of the book and the matched book information as a checking result.

The process of constructing and training the rotating target model in the step 3 can be completed in advance before the step 1; the sequence of text matching in the step 4 and target tracking in the step 5 can be exchanged, and the sequence of target tracking in the step 5 and correction matching in the step 6 can also be exchanged, as long as text recognition and correction matching can be carried out on all appearance instances of the same book.

As shown in fig. 2, it is a diagram of a relationship between deployment positions of a camera and a bookshelf in an implementation manner, and specific deployment requirements are as follows:

and fixing the shooting position of the camera, and placing the camera on a mobile platform which is at the same height as the target scanning bookshelf in parallel. Considering that the horizontal visual angle of a common industrial camera is larger than the vertical visual angle, the camera is longitudinally arranged on the movable platform, and the distance from the bookshelf is further adjusted according to the size of the horizontal visual angle designed by the camera, so that the camera can shoot all book information on the complete layer of bookshelf, and books on the upper layer or the lower layer cannot be shot.

After the camera is fixed, the white balance value of the camera is adjusted, and because the industrial camera does not have an automatic focusing function, a camera lens needs to be rotated for manual focusing, so that characters of a shot book are clear and visible. In the embodiment provided by the application, the moving speed of 0.1m/s is adopted, in order to ensure that the characters in the shot video do not have smear, proper camera exposure time needs to be selected, and the longer the exposure time is, the larger the displacement of the smear of the characters in the picture is; although the smear can be reduced when the exposure time is lower, the video brightness is reduced, the shooting result is too dark, and a light supplement lamp needs to be arranged at a proper position on the mobile platform and the brightness is adjusted, so that the final exposure time value is in the range of 2000-4000 mu s, and if the brightness does not achieve the expected effect after the light supplement lamp is added, the gain value can be properly increased.

Each different camera is individually calibrated for distortion correction: and printing checkerboard pictures, taking enough checkerboard pictures, and calculating internal and external parameters of the camera based on the pictures by using a Zhangyingyou checkerboard calibration method.

Taking the moving scanning of books on a layer of a bookshelf as an example, recording to obtain a book video of the current layer, segmenting frame by frame to obtain a picture sequence with two adjacent pictures partially overlapped, and performing distortion correction on each picture in the picture sequence according to calculated camera parameters to store the picture sequence as a shooting result of the current layer; and acquiring corresponding book information from a database of the inventory library according to the shelf number of the current inventory bookshelf, and forming acquisition data together with the picture sequence.

As shown in fig. 3, which is a network structure diagram of an optical flow estimation model PWC-Net, fig. 4 is a result diagram of optical flow estimation performed on a picture sequence in an implementation manner, and the specific process is as follows:

constructing an optical flow estimation model PWC-Net, shooting book videos of a plurality of bookshelves, and training the optical flow estimation model based on the shot videos in a self-supervision mode;

and performing optical flow estimation on the picture sequence to obtain optical flow data, which describes which pixel point corresponds to the pixel point in the previous picture after the pixel point in the previous picture is moved, and fig. 4 shows the optical flow estimation result of some points in the original picture.

FIG. 5 is a flow diagram of a process for constructing and training a rotating instance object detection model in one implementation, including the following steps.

Step 301, selecting a bookshelf in a certain area of a certain library, deploying a camera, shooting a plurality of groups of real book pictures in different environments, and marking the coordinates of a book spine and a bounding box of a book searching number on each picture.

Step 302, taking pictures of a plurality of empty bookshelves and collecting a plurality of background pictures from the ImageNet data set as a background picture library of the synthetic segmentation data set.

Step 303, cutting book examples from the shot real book pictures, randomly extracting a plurality of examples from the book examples, and performing operations such as random stretching, rotation, illumination change, random light spot and noise addition, contrast adjustment, definition adjustment and the like; a background is randomly extracted from a library of background pictures, subjected to random size expansion, contrast transformation, and then the selected instance is pasted onto the background at random separation distances.

And step 304, generating synthetic labeling data according to the coordinate information of the book spine and the book searching number on the new background, and forming a synthetic data set together with the synthetic pictures.

And 305, training a Mask-RCNN model for rotation example detection based on the synthetic data set to obtain a one-stage model, wherein the one-stage model is characterized by being suitable for multiple scenes and strong in generalization.

And step 306, training the one-stage model based on the real picture fine tuning, so that the one-stage model can better adapt to the real scene, correcting the problem of inconsistent data distribution caused by noise in the synthetic data set, and finally obtaining the high-precision target detection model conforming to the real prediction scene.

As shown in fig. 6, the Mask-RCNN model structure diagram for the rotation target detection is mainly different from the original Mask-RCNN model structure diagram in that the RPN network outputs candidate bounding boxes with rotation angles. FIG. 7 is a diagram of an effect of example segmentation on a sequence of pictures in an implementation. Fig. 7(a) shows an original to be detected, fig. 7(b) shows a rotation bounding box obtained by detecting an object, fig. 7(c) shows a schematic diagram of extending left and right line segments of the spine bounding box, and fig. 7(d) shows an example segmentation result.

The Mask-RCNN model for target detection aiming at the rotation example comprises a rotation candidate frame generation network and an image classification network. Rotating the candidate box generation network to calculate the input picture and output a rotating bounding box possibly containing a book spine or a book searching number example; then, the images in the rotating surrounding frames are sent to an image classification network, and the content is identified to be a book spine instance, a book searching number instance or a background; and finally, taking the coordinates of the surrounding frame and the corresponding instance category as a target detection result of the network. The process of spine and book index example segmentation for FIG. 7(a) is as follows:

inputting the figure 7(a) into a trained two-stage rotation example target detection model to obtain rotation surrounding frames corresponding to books and book searching number examples which are preliminarily identified, filtering the surrounding frames with too small area, and removing the surrounding frames exceeding the picture boundary to obtain the rotation surrounding frames as the figure 7 (b);

and (3) carrying out two-side segmentation on the book spine, and extending the left and right lines of the identified book spine surrounding frame to the upper and lower edges of the picture to obtain four intersection points, thereby forming a new spine segmentation frame. Since the function of the book number surrounding frame is to distinguish the area of the book number from the spine example, no operation of extension is required. The effect after the left and right line segments of the book spine surrounding frame are extended is shown in fig. 7 (c);

the obtained example segmentation frames may overlap, and the overlapping phenomenon is more serious particularly in the case that the generalization effect of the target detection model is not ideal (for example, a new library with widely different environments is encountered, but the model is not finely adjusted yet). At this time, non-maximum value suppression needs to be performed on the segmentation frames of the book spine and the book index number examples respectively, and the segmentation frames which are overlapped with other results but have low confidence coefficient are removed, so that uniqueness and stability of the segmentation result of each example are guaranteed. The non-maximum suppression reduces the number of division results to some extent, but increases the accuracy of division, and the effect of suppressing the non-maximum is shown in fig. 7(d) in fig. 7 (c). It is noted here that since the overlap area between the split boxes of the tilted book and the upright book is minimal, the split boxes of the tilted book are still retained and the recall rate of the instance split does not change.

As shown in fig. 8, the effect diagram of text recognition on a picture sequence in an implementation manner is shown. Fig. 8(a) shows the text recognition result of the original, and fig. 8(b), 8(c), and 8(d) show three examples selected from examples of books divided from the original, each example corresponding to one book and having the corresponding text recognition result of the book name and the book number (the recognition book number is in parentheses).

After a text recognition model is constructed based on DB and CRNN and trained on an ICDAR 2019-LSVT Chinese text data set, the original image is recognized by using the trained model, so that the result as shown in FIG. 8(a) can be obtained, but the original image only contains the recognized text on the picture and the position of each text, and does not know which book or book number example the text belongs to. Therefore, it is necessary to divide the frame coordinates according to the instance of the current picture, determine to which book or book index number instance the center of each recognized text box belongs, and assign the text thereof to the corresponding instance. Finally, for a plurality of text boxes distributed to the same instance, characters in the text boxes are spliced to form an identification text of the book name or the book searching number of the book.

As shown in fig. 9, it is an effect diagram of performing target tracking on a picture sequence in an implementation manner. Fig. 9(a) and 9(b) show two adjacent pictures randomly extracted from a picture sequence, where fig. 9(a) is a left picture and fig. 9(b) is a right picture. The process of book object tracking for fig. 9(a) and 9(b) is as follows:

step 501, extracting the optical flow between fig. 9(a) and fig. 9(b) from the optical flow data, obtaining the position coordinate information of each spine instance from the instance segmentation result of the two pictures, and obtaining the recognition text and the book searching number corresponding to each spine instance from the text recognition result.

Step 502, calculating coordinates of a center point according to coordinates of a bounding box of the spine instance in fig. 9(a), as a position representation of each spine instance in fig. 9 (a).

In step 503, it is calculated to which point in fig. 9(b) the center point of each spine instance in fig. 9(a) has moved through the optical flow, and the moved center point is called as a new center point. Calculating which spine instance enclosing frame in fig. 9(b) each new central point belongs to, and obtaining the correspondence between the spine instance in fig. 9(a) and the spine instance in fig. 9 (b).

Step 504, merging the segmentation results and text recognition results of the matched book spines in fig. 9(a) and fig. 9(b), that is, aggregating the < picture, book position, book name and book index number recognition text > pairs belonging to the same book to obtain a plurality of coordinates and a plurality of text recognition results corresponding to the same book instance, and calculating the absolute position of the same book in the global coordinate system formed by the two pictures, wherein the form of the absolute position is < instance coordinates, book name and book index number text recognition sequence >.

It can be seen that the text recognition result in fig. 8 has some places where recognition is wrong, and the additional information of the book publisher, the author, and the book to be recognized may be mixed with the book name, especially when the book name contains more complicated letters, chinese characters, or special fonts, the recognition error is the most serious. The target tracking just combines the texts of the fuzzy recognition, so that the texts cannot be used as a final inventory result, and the correction and matching of the book name and the book index number are still required to be performed on the recognition result after the target tracking is combined.

For the book name and book searching number text identification sequences of the same book, firstly, the accurate matching of the book searching number sequences is directly carried out in a whole library data set, and if any identification book searching number is successfully matched, the information of the corresponding book is used as the identification result of the current example; otherwise, when the matching of the index number fails, performing word segmentation on the book name recognition sequence and the book names in the current layer book candidate set respectively, then for each recognized book name in the book name recognition sequence, finding out items with the same words in the candidate set after word segmentation, calculating similarity scores of the book name to be matched and the candidate book name by using TF-IDF, selecting the item with the highest score as a matching result of the current book name, and finally selecting the result which is matched for the most times from the corrected book name matching sequence to be output as an inventory result. And finally, aggregating the global coordinate information of the book instance and the matched book information, and presenting the checking result to the user in a graphical interface or report form mode.

Therefore, in a specific embodiment of the application, on one hand, a deep learning technology in computer vision is introduced into the library inventory field where data set acquisition is difficult, a rotating instance target detection model and a text recognition model are constructed and trained in a data set synthesis mode, and a light stream estimation model can be trained in a self-supervision mode, so that a good effect can be obtained without a large amount of source data acquisition and manual labeling, and the method has good feasibility and high precision; on the other hand, for the missed recognition or the error recognition caused by environmental change or insufficient model generalization performance, the scheme can utilize a target tracking algorithm to recognize each frame of the video, so that the possibility of recognizing a book is increased, the problem of error recognition caused by an angle problem is solved to a certain extent, and the recognition accuracy is improved; in addition, the scheme can accurately position the global coordinates of the books based on the optical flow and the example segmentation result, so that a librarian can quickly arrange misplaced books, and meanwhile, a reader can quickly find books to be borrowed.

The above description is only a preferred embodiment of the present invention, and it is not intended to limit the scope of the present invention, and various modifications and changes may be made by those skilled in the art. Variations, modifications, substitutions, integrations and parameter changes of the embodiments may be made without departing from the principle and spirit of the invention, which may be within the spirit and principle of the invention, by conventional substitution or may realize the same function.

Claims

1. A book checking method based on computer vision is characterized by comprising the following steps:

2. The computer vision-based book inventory method of claim 1, wherein in the step S2, the method for performing optical flow estimation on the picture sequence comprises:

3. The computer vision-based book inventory method of claim 1, wherein in the step S3, the method for constructing and training the rotation instance target detection model comprises:

4. The computer vision-based book inventory method of claim 1, wherein in the step S3, the method for instance dividing the picture sequence comprises:

5. The computer vision-based book inventory method of claim 1, wherein in the step S4, the method for training the text recognition model and performing text recognition comprises:

step S403, for each picture in the picture sequence, distributing the text corresponding to the text box to each spine and book searching number example according to the segmented spine and book searching number positions and the identified text box position, wherein the text corresponding to each spine and book searching number box is the splicing result of the distributed text; thus obtaining the position frame of each book spine and each book searching number and the corresponding identified text, namely the pair of < picture, position coordinate and identification text >; the book name recognition result comprises information of a book title, an author and a publishing company, the book searching number recognition result comprises information of a book searching number English letter and a library name, and Chinese characters in the book searching number need to be removed;

6. The computer vision-based book inventory method of claim 1, wherein in the step S5, the target tracking process comprises: based on the optical flow data, calculating the center of all the book spine and book searching number examples on the previous picture, moving to which book spine or book searching number example area in the next picture through the optical flow, so as to match the repeated book examples appearing in the previous picture and the next picture, and aggregating the < picture, book position, book name and book searching number identification text > pairs belonging to the same book to obtain all the identification results of the same book in different frames, wherein the form of the identification results is < example coordinate, book name and book searching number text identification sequence >; the coordinates of each book instance at this time are global coordinates in the bookshelf layer, and the starting or ending position of the bookshelf is the origin.

7. The computer vision based book inventory method of claim 1, wherein the step S6, the method for correcting and matching based on the book database candidate set includes:

8. The computer vision-based book inventory method of claim 3, wherein in the step S301, the synthesized instance division data set comprises a real book picture and a synthesized book picture; the synthesis method comprises the following steps:

step S3012, synthesizing and enhancing an example segmentation data set based on the shot real bookshelf picture: cutting out the book spine of each book from the real picture, and performing data enhancement on the book spine by random rotation, random deformation, random illumination change, random white noise increase and random light spot addition to obtain a plurality of new book spine examples;

9. The computer vision-based book checking method of claim 3, wherein in step S6, before the recognition results of the title text and the index text are matched, the method further comprises: and inquiring the information of the books on shelves on the corresponding bookshelf layer from the library database according to the number of the bookshelf and the number of the layer scanned by the current camera, and taking the information as a matched candidate book set.

10. The computer vision-based book inventory method of claim 1, wherein in the step S1, the video frame-by-frame splitting method comprises: reading the video frame by frame, carrying out distortion correction and image rotation operation on each frame of picture, and obtaining the distortion-free bookshelf front view.