CN114638870A

CN114638870A - Indoor scene monocular image depth estimation method based on deep learning

Info

Publication number: CN114638870A
Application number: CN202210251724.8A
Authority: CN
Inventors: 刘佳涛; 张亚萍
Original assignee: Yunnan Normal University
Current assignee: Yunnan Normal University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-17

Abstract

The invention relates to an indoor scene monocular image depth estimation method based on deep learning, and belongs to the technical field of three-dimensional scene perception. The method comprises the steps of firstly introducing a neural network EfficientNet-b7 for image classification pre-training on ImageNet, constructing an encoder, introducing SENet-based residual connection and convolution and resampling calculation operations on different stages of the encoder, then constructing a loss function focusing on the global to local of an image on the basis of the idea of depth interval division, applying the loss function to prediction on different stages to obtain prediction on different stages, and finally fusing depth information predicted on different stages by using a self-attention mechanism-based transform structure to output a scene depth prediction result. The invention improves the characteristics of different stages of the traditional serial fusion encoder into parallel fusion by designing a novel, efficient and lightweight decoder, thereby improving the comprehensive utilization capability of the model on the global and local information of the image during depth estimation.

Description

Indoor scene monocular image depth estimation method based on deep learning

Technical Field

The invention relates to an indoor scene monocular image depth estimation method based on deep learning, and belongs to the technical field of three-dimensional scene perception.

Background

Depth estimation from two-dimensional RGB images has a wide range of applications, for example: three-dimensional reconstruction, scene understanding, autopilot, robotics, and the like. With the advent of large-scale data sets and the improvement of hardware computing power, recent research on image depth estimation has focused mainly on two-dimensional to three-dimensional reconstruction using depth learning and convolutional neural networks. Depth estimation from a single RGB image is an ill-defined problem because one picture can correspond to an unlimited number of three-dimensional scenes. Furthermore, lack of scene coverage, translucent or reflective materials, etc. may lead to blurring where the geometry cannot be deduced from the appearance.

A method for monocular depth estimation based on depth learning begins with the dual-scale network proposed by Eigen et al. Some researchers then proposed many effective methods based on deep learning using convolutional neural networks. The document "Laina et al, deep Depth Prediction with full volumetric Residual Networks" uses a full convolution Residual network based on ResNet-50 and replaces the full connection layer with a series of upsampled blocks. The document "Alhashim et al, High Quality singular Estimation view Transfer Learning" introduces a jump connection in a simple encoder-decoder network architecture and uses a Transfer Learning training model. The document "Lee et al, From Big to Small: Multi-Scale Local Planar guide for singular Depth Estimation" proposes to replace the standard upsampling layer with a Local Planar guiding layer to guide the feature to full resolution in the decoder. The document "Fu et al, Deep regression network for singular regression" found that if a Deep regression task is converted into a classification task, its performance can be improved. The document 'Bhat et al, AdaBins: Depth Estimation using Adaptive Bins' designs an AdaBins module, divides a Depth value range into 256 intervals, the central value of each interval is the Depth value of a pixel falling in the interval, and the final Depth of one pixel is the value of a linear combination interval of the central Depth. The literature "Ranftl et al, Vision transforms for depth Prediction" applies Vision transform to monocular depth estimation, and obtains a highly accurate depth estimation model by training with a large data set.

Although there are currently great advances in depth estimation of indoor monocular images based on depth learning, there are still some problems: 1) in most coding and decoding structures used by the deep learning neural network, an encoder can cause the problems of insufficient feature extraction, loss of spatial information and the like in a feature extraction stage due to operations such as down-sampling of display, so that the network is easy to lose fine-grained information of an image; 2) the actual scene structure faced by indoor scene monocular depth estimation is usually complex, and if the global and local relations in the scene are not effectively considered, the accuracy rate in depth estimation is very low; 3) although the appearance of Vision transform can greatly improve the image granularity loss problem, the model parameters are large in quantity and a large amount of labeling data is needed to drive training.

Disclosure of Invention

The invention aims to solve the technical problem of providing an indoor scene monocular image depth estimation method based on depth learning, aiming at the problem that fine-grained information of an image is easy to lose in the deep layer of a monocular depth estimation coding network using a convolutional neural network, and comprehensively using the characteristics of multi-stage coding. In a decoding network, aiming at the problem that the traditional network is difficult to effectively consider the global and local relation in a scene under a complex scene, a decoder which respectively predicts global to local depth information among parallel partitions, then adjusts and fuses is designed, and a loss function is correspondingly designed, so that the problems are solved.

The technical scheme of the invention is as follows: a depth estimation method for monocular images of indoor scenes based on depth learning specifically comprises the following steps:

step 1: the encoder was constructed by introducing the neural network EfficientNet-b7 pre-trained for image classification on ImageNet.

Step 2: and introducing residual error connection based on SENET and calculation operation of convolution and resampling at different stages of an encoder to obtain prediction at different stages.

Step 3: based on a depth interval division method, a loss function focusing on the global to local of an image is constructed and applied to prediction of different stages.

Step 4: and fusing depth information predicted in different stages by using a Transformer structure based on a self-attention mechanism, and outputting a scene depth prediction result.

The Step1 is specifically as follows: downloading from the Internet an EfficientNet-b7 network pre-trained on ImageNet, obtaining its feature vectors encoded at blocks 3, 5, 6, 8, 12, the resolution of these feature vectors being that of the input image respectively

The Step2 is specifically as follows:

step2.1: the feature vector encoded by the 3 rd block is input into 4 SENET-based residual blocks, the feature vector encoded by the 5 th block is input into 3 SENET-based residual blocks, the feature vector encoded by the 6 th block is input into 2 SENET-based residual blocks, and the feature vector encoded by the 8 th block is input into 1 SENET-based residual block.

Step2.2: the channel attention layer is added after the last residual block of each stage and a residual connection from the encoder to the layer is added.

Step2.3: the features of each stage are gradually passed through the double upsampling and convolution layers to obtain five stages of features with the same number of channels of 30 and the same resolution of half the input resolution.

Step2.4: adding and fusing the features of the 1 st, 2 nd and 5 th stages pixel by pixel, adding and fusing the features of the 2 nd, 3 rd and 5 th stages pixel by pixel, adding and fusing the features of the 1 st, 3 rd and 4 th stages pixel by pixel, adding and fusing the features of the 1 st, 4 th and 5 th stages pixel by pixel, then obtaining four predictions through a convolutional layer, and marking the four predictions as prediction 1 to prediction 4 from shallow to deep according to the neural network.

According to the fusion selection, local prediction and global prediction can be respectively carried out from light to deep, then the characteristics of the first stage are taken as the reference of the next two predictions, and the characteristics of the fifth stage are taken as the reference of the above two predictions, so that the global prediction and the local prediction can be better completed; when fusion is carried out, the fused accuracy can be improved by abundant spatial information contained in the shallow feature. Compared with the traditional method in which the input is serially fused in one step and then output, the method not only improves the efficiency, but also improves the precision.

The Step3 specifically comprises the following steps:

step3.1: the maximum depth d _ max and the minimum depth d _ min are obtained from the real depth map.

Step3.2: the depth interval [ d _ min, d _ max ] is equally divided into 10 cells, and the calculation formula of the length of one cell is as follows:

in these 10 intervals, the depth value range calculation formula of the ith interval is as follows:

[d_min+(i-1)×len，d_min+i×len]

step3.3: and (3) making a histogram for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals, wherein the interval contains most global information, and correspondingly, the interval occupying the smaller proportion contains more local information.

Step3.4: and (3) performing descending order arrangement on the 10 depth intervals according to the occupied proportion, calculating the mean square error of the prediction 1 in the 5 th interval to the 10 th interval in Step2.4, predicting the mean square error of the prediction 2 in the 4 th interval to the 8 th interval, predicting the mean square error of the prediction 3 in the 2 nd interval to the 4 th interval, and predicting the mean square error of the prediction 4 in the 1 st interval and the 2 nd interval.

Step3.5: combining the four parts of errors as one loss term which is used for restricting the prediction 1 to the prediction 4 to pay attention to the local part to the global part during model training, wherein the calculation formula is as follows:

wherein λ₁＝0.5，λ₂＝λ₃＝0.6，λ₄＝1，n_iIs the total number of pixels of the real depth map after the interval mask,

and

respectively a pixel point p in the true depth map and the prediction i_iThe depth value of (2).

Through the method based on depth interval division used by Step3, a specific loss function is obtained as a supplementary constraint for subsection prediction. The method is applied to the prediction of different stages, so that the global prediction and the local prediction can be gradually separated into different stages when the model is trained, and the prediction of each stage of the model can pay attention to the depth interval which can be predicted more accurately in the stage.

The Step4 is specifically as follows:

step4.1: splicing the 4-stage prediction results into a four-channel tensor

Step4.2: the four-channel tensor

Performing convolution operation with convolution kernel of 16 × 16, step size of 16 and output channel of 4, namely:

step4.3: flattening the two-dimensional tensor obtained after convolution into one dimension, namely:

step4.4: inputting one-dimensional tensor into Transformer EncoderAnd recovering the one-dimensional tensor as a two-dimensional tensor as a weight matrix

Step4.5: tensor of four channels

Convolution operation with convolution kernel of 3 × 3, step size of 1 and output channel of 128 is performed to obtain a shape of

Tensor of

Step4.6: weight matrix

And tensor

And after pixel-by-pixel dot product operation is carried out, outputting a final prediction result through a series of convolution layers.

Compared with the traditional convolution method after channel splicing, the fusion method adopted by Step4 can greatly improve the accuracy of model prediction on the premise of introducing almost no more parameter quantity.

The invention has the beneficial effects that:

(1) in order to fully utilize fine-grained information of an input image and extract feature vectors from multiple stages of an encoder, the invention solves the problem that fine-grained information is easy to lose in the depth of a coding network in the traditional method;

(2) the invention designs a decoder which predicts global-to-local depth information respectively among parallel partitions, adjusts the fused decoder, correspondingly designs a loss function, and optimizes the problem that the global and local relations in a scene are difficult to be effectively considered in a traditional network under a complex scene.

(3) The invention uses the convolution neural network as the basis, and can achieve more accurate effect without driving training by a large data set.

(4) The invention does not need a very large data set to drive when training, and can effectively improve the precision of monocular depth estimation.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

fig. 2 is a schematic diagram showing a comparison between a monocular depth estimation network adopted by the present invention and current most advanced networks AdaBins and DPT-Hybrid predicted depth maps in some scenarios, where:

(a) is an input RGB image;

(b) is a true depth map;

(c) is a depth map for AdaBins prediction;

(d) is a depth map of DPT-Hybrid prediction;

(e) is the depth map predicted by the present invention;

FIG. 3 is an exemplary diagram of the present invention for generating a three-dimensional point cloud from a single RGB image by predicting depth.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1, a depth estimation method for monocular images of an indoor scene based on depth learning specifically includes the following steps:

in the characteristic extraction stage of the encoder, extracting the characteristic vectors coded by the 3 rd, 5 th, 6 th, 8 th and 12 th blocks in the EfficientNet-b7 encoder, wherein the shapes of the characteristic vectors are respectively

Where H and W are the height and width of the input image, respectively.

Then the feature vector coded by the 3 rd block is inputted into 4 SENEt-based residual blocks, the feature vector coded by the 5 th block is inputted into 3 SENEt-based residual blocks, the feature vector coded by the 6 th block is inputted into 2 SENEt-based residual blocks, and the feature vector coded by the 8 th block is inputted into 1 SENEt-based residual block.

And adding a channel attention layer after the last residual block of each stage, adding a residual connection from an encoder to the layer to construct a large residual block, and gradually passing the characteristics of each stage through a double upsampling layer and a convolution layer to obtain the characteristics of five stages with the same channel number of 30 and the same resolution of half of the input resolution.

And then, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 2 and 5, carrying out pixel-by-pixel addition fusion on the features of the stages 2, 3 and 5, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 3 and 4, carrying out pixel-by-pixel addition fusion on the features of the stages 1, 4 and 5, then obtaining four predictions through a convolutional layer, and marking the predictions as 1 to 4 according to the shallow-to-deep marks of the neural network.

A penalty function focused on local to global depth is then designed for predictions 1 through 4.

Firstly, acquiring a maximum depth d _ max and a minimum depth d _ min from a real depth map, and then averagely dividing a depth interval [ d _ min, d _ max ] into 10 intervals, wherein the calculation formula of the length of one interval is as follows:

[d_min+(i-1)×len，d_min+i×len]

then, a histogram is made for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals, wherein the interval contains most global information, and correspondingly, the interval occupying the smaller proportion contains more local information.

And next, performing descending order arrangement on 10 depth intervals according to the occupied proportion, calculating and predicting the mean square error of 1 in the 5 th interval to the 10 th interval, predicting the mean square error of 2 in the 4 th interval to the 8 th interval, predicting the mean square error of 3 in the 2 nd interval to the 4 th interval, and predicting the mean square error of 4 in the 1 st interval and the 2 nd interval.

And combining the four parts of errors as one loss term which is used for constraining the predictions 1 to 4 to focus on the local part to the global part during model training, wherein the calculation formula is as follows:

and

After obtaining predictions 1 through 4, the four-part predictions need to be fused.

Firstly, the prediction results of the 4 stages are spliced into a four-channel tensor

Then the four-channel tensor is expressed

Performing convolution operation with convolution kernel of 16 × 16, step size of 16 and output channel of 4, that is:

then, flattening the two-dimensional tensor obtained after convolution into one dimension, namely:

next, the one-dimensional tensor is input into the Transformer Encoder, andthe one-dimensional tensor outputted by the weight matrix is restored to be a two-dimensional tensor as the weight matrix

The next step is to make the four-channel tensor

Tensor (A)

Finally, the weight matrix is processed

And tensor

The invention uses data in NYUDepth v2 and SUN RGB-D data sets to carry out experiments on the proposed depth learning-based indoor scene monocular image depth estimation method, the NYUDepth v2 data set is obtained by acquiring an indoor scene by a Microsoft Kinect RGBD camera, and the SUN RGB-D data set acquisition equipment comprises Intel Realsense, Asus Xtion, Kinect v1 and Kinect v 2. Both data sets are indoor scene data sets, and the SUN RGB-D data sets contain more complex scenes.

Table 1 shows statistics of the average relative error, the root mean square error, the log mean error and the accuracy under the threshold value of the NYUDepth v2 data set after training of the parallel decoder used in the present invention and the conventional simple serial decoder, and the parameters included in the model. As can be seen from the data in table 1, the method of the present invention obtains better results compared with the conventional method, improves the accuracy of depth map estimation to a certain extent, and reduces the model parameters by 29.4%.

TABLE 1

Table 2 statistics of the loss terms for prediction i using the design of the present invention versus the average relative error, root mean square error, log mean error and accuracy at the threshold for NYUDepth v2 dataset without training. As can be seen from the data in Table 2, the loss term for the prediction i designed by the present invention can effectively reduce the model prediction error.

TABLE 2

Table 3 summarizes the average relative error, the root mean square error, the logarithmic mean error and the accuracy under the threshold value in the NYUDepth v2 data set after the Transformer fusion method used in the present invention is directly trained by the softmax calculation fusion method, and the parameters included in the model, wherein the calculation method by softmax is as follows:

where blocki is prediction i in fig. 1. Specifically, the prediction i is convolved, mapped between 0 and 1 by a sigmoid function, multiplied by the prediction i, and finally added pixel by pixel to obtain an output. The encoder in this set of experiments used EfficientNet-B3 pre-trained on ImageNet with fewer parameters and faster training. As can be seen from the data in table 3, the use of Transformer is advantageous in terms of both accuracy and error, and the added parameters are not much, compared to simple computational fusion.

TABLE 3

Table 4 shows statistics of the average relative error, root mean square error, log mean error and accuracy under a threshold value in the NYU Depth v2 data set after training when the number of intervals is set to 1, 4, and 10, respectively, in the loss term for prediction i designed by the present invention. The encoder in this set of experiments used EfficientNet-B3 pre-trained on ImageNet as well. As can be seen from the data in table 4, the division of 10 intervals can achieve better effect.

TABLE 4

Fig. 2 is a schematic diagram showing a comparison between the monocular depth estimation network adopted by the present invention and the current most advanced networks AdaBins and DPT-Hybrid, which are predicted depth maps in some scenarios, wherein: (a) an input RGB image; (b) a true depth map; (c) AdaBins predicted depth map; (d) a DPT-Hybrid predicted depth map; (e) the invention relates to a predicted depth map. As can be seen from the figure, the depth information of the indoor monocular RGB image can be accurately predicted, and compared with AdaBins, the edge outline of the object is clearer.

Fig. 3 shows a schematic diagram of a three-dimensional point cloud generated by using a depth map predicted by the present invention, and it can be seen from the diagram that the present invention can effectively recover three-dimensional depth information from a two-dimensional image, and has a better guiding effect on tasks such as three-dimensional reconstruction and scene understanding.

Table 5 summarizes the average relative error, root mean square error, log mean error and accuracy under threshold for the present invention and the current state-of-the-art methods AdaBins and DPT-Hybrid in the NYHDepth v2 dataset. As can be seen from the data in Table 5, the method of the present invention achieves better results on a plurality of indexes, and the accuracy of depth map estimation is improved to a certain extent. Although both these most advanced methods and the encoder of the method of the present invention were pre-trained on ImageNet, DPT-Hybrid required a lot of additional training data in the model training and fine-tuning on NYU Depth v2 for better results. Specifically, DPT-Hybrid requires first training 60 epochs on a dataset containing 140 thousand images and then fine-tuning on NYU Depth v2, but AdaBins and the model of the invention only require training 25 epochs and 20 epochs, respectively, on a subset of 5 thousand images of NYU Depth v 2.

TABLE 5

Table 6 shows statistics of the average relative error, the root mean square error, the log mean error and the accuracy under the threshold obtained by testing the method of the present invention and the most advanced methods at present, AdaBins and DPT-Hybrid, under the NYUDepth v2 data set and then under the SUN RGB-D data set. Since an effective method for processing the inverse depth in the real depth map is not found during the test, only the missing value is masked in the experiment, and since the DPT-Hybrid can only be input at the resolution of 480 × 640, the image resolution is uniformly adjusted to 480 × 640 in the experiment. As can be seen from the data in Table 6, the model generalization capability of the present invention is roughly ranked second among the three.

TABLE 6

Table 7 shows statistics of the model parameters of the method of the present invention, AdaBins and DPT-Hybrid, which are currently the most advanced methods, and the time required for one-time prediction. As can be seen from the data in Table 7, the model of the present invention has fewer parameters than the other two models. For the time required for one prediction, the model of the present invention is slightly slower than AdaBins, ranked second. The reasoning speed experiment is completed on a machine equipped with NVIDIA GeForce GTX 1660ti GPU, and the resolution of input images is 480 multiplied by 640. Because the output of DPT-Hybrid is twice the output resolution of the AdaBins and the invention, the AdaBins output and the invention also include twice the resampling time in an experiment. In addition, the times in the table are calculated by averaging the times after 5 thousand inferences.

TABLE 7

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A depth learning-based indoor scene monocular image depth estimation method is characterized by comprising the following steps:

step 1: introducing a neural network EfficientNet-b7 pre-trained by image classification on ImageNet, and constructing an encoder;

step 2: introducing residual error connection based on SEnet and calculation operation of convolution and resampling at different stages of an encoder to obtain predictions at different stages;

step 3: constructing a loss function focusing on the global to local of an image based on a depth interval division method, and applying the loss function to prediction in different stages;

step 4: and (3) fusing depth information predicted at different stages by using a Transformer structure based on an attention mechanism, and outputting a scene depth prediction result.

2. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 1, wherein Step1 is specifically: download from the Internet an EfficientNet-b7 network pre-trained on ImageNet, obtaining its coded at blocks 3, 5, 6, 8, 12Feature vectors of which the resolutions are respectively of the input image

3. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 2, wherein Step2 is specifically:

step 2.1: inputting the feature vector coded by the 3 rd block into 4 SENET-based residual blocks, the feature vector coded by the 5 th block into 3 SENET-based residual blocks, the feature vector coded by the 6 th block into 2 SENET-based residual blocks, and the feature vector coded by the 8 th block into 1 SENET-based residual block;

step 2.2: adding a channel attention layer after the last residual block of each stage and adding a residual connection from the encoder to the layer;

step2.3: gradually passing the features of each stage through twice upsampling and convolutional layers to obtain the features of the five stages with the same channel number of 30 and the same resolution of half of the input resolution;

step2.4: and performing pixel-by-pixel addition fusion on the features of the stages 1, 2 and 5, performing pixel-by-pixel addition fusion on the features of the stages 2, 3 and 5, performing pixel-by-pixel addition fusion on the features of the stages 1, 3 and 4, performing pixel-by-pixel addition fusion on the features of the stages 1, 4 and 5, and then performing convolution layer to obtain four predictions, wherein the four predictions are marked as prediction 1 to prediction 4 according to the shallow-to-deep label of the neural network.

4. The depth estimation method for monocular images of indoor scenes based on depth learning as claimed in claim 3, wherein Step3 is specifically:

step3.1: acquiring a maximum depth d _ max and a minimum depth d _ min from the real depth map;

step3.2: the depth interval [ d _ min, d _ max ] is divided into 10 cells on average, and the calculation formula of the length of one cell is as follows:

[d_min+(i-1)×len，d_min+i×len]

step3.3: making a histogram for the real depth map to find an interval occupying the largest scene depth proportion in 10 intervals;

step3.4: according to the occupied proportion, 10 depth intervals are subjected to descending order arrangement, the mean square error of a prediction 1 in a 5 th interval to a 10 th interval in Step2.4 is calculated, the mean square error of a prediction 2 in a 4 th interval to an 8 th interval is calculated, the mean square error of a prediction 3 in a 2 nd interval to a 4 th interval is calculated, and the mean square error of a prediction 4 in a 1 st interval and a 2 nd interval is calculated;

and

5. The depth estimation method for monocular images of indoor scenes based on deep learning of claim 1, wherein Step4 is specifically:

step4.1: splicing the 4-stage prediction results into a four-channel tensor

Step4.2: the four-channel tensor

step4.4: inputting the one-dimensional tensor into a Transformer Encoder, and recovering the output one-dimensional tensor into a two-dimensional tensor as a weight matrix

Step4.5: tensor of four channels

Tensor of

Step4.6: weight matrix

And tensor