All about Super-resolution and Object Detection methods (Part 1)
Updated: Feb 14
Deep convolutional neural networks' emergence has led to significant advancements in object detection during the past few years. Due to the poor visual appeal and noisy representation brought on by the inherent structure of small targets, Small Object Detection (SOD), one of the famously difficult challenges in computer vision, could not be concealed by such wealth. In this post, Pixta AI will first do a thorough analysis of SOD and Super-resolution, and then list out the ways in which they might be used to address the issue of poor object data quality.
What is Object Detection? How important is it?
Simply to say, Object Detection is a basic problem but plays a big role in classifying and locating objects in an image or video. With the great development in terms of data as well as research, the Object Detection problem has achieved many admirable results and is widely applied in practice. Object Detection for Super-resolution and Small Object Detection (SOD) is a small array used to identify objects that are relatively small in size compared to an image, some of which are important in practice. Some of these applications can be listed as: identify objects through satellite images, drones; identification of people and vehicles through surveillance problems; recognize traffic signs for self-driving car problems; etc.
Object prediction for Super-resolution and SOD problems
How about Super-resolution?
Different from the vigorous development of conventional Object Detection problems, Super-resolution and SOD often have less research and tend to develop more slowly. As a general belief, increasing the image size will make the features of the objects more recognizable, thereby resulting in better prediction results. But in fact, for the results of conventional Object Detection models, we will see that there is a huge gap in the results when recognizing small objects compared to medium sized objects (Table 1).
Table 1: Comparison of common Object detection algorithms on MS COCO DATASET set.
The difficulty when implementing Super-resolution and Object Detection methods?
Small-sized objects often have low features, and during convolution, spatial reduction to create representative feature maps will inadvertently remove those objects along with the environment, causing the model to fail so that representative features of the object cannot be extracted.
Representative features extracted from unstable model
Representative features are important for classification and identification of objects. Small-sized objects are often of poor quality and are mixed with the surrounding environment and other objects, causing the extracted features to contain noise from the environment, affecting the recognition results.
With the Super-resolution problem in particular, high-quality images mean various sizes for the object, making the search environment more complex. So, the model will need more data to learn, including special features. Due to the size of the data, not to mention the lack of high-quality data involved, the results are often poorer than when applied to common datasets (COCO, ImageNet, ...)
The object size has a different effect on the deviation of the bounding box
Determining the coordinates of an object is one of the important parts of the Object Detection problem and this result is often evaluated through the IoU (Intersection over Union) metric. However, the evaluation results when determining coordinates on small objects will be more difficult than on larger objects. As shown in the figure, with a deviation of 3 pixels horizontally and vertically compared to Ground Truth, the prediction results of small objects are reduced quite a lot (from 100% to 32.5%) compared to medium sized objects. and large (56.6% and 71.8%).
The size of the object affects the deviation of the bounding box. The top left, bottom left, and right represent small (20×20), medium (40×40) and large (70×70) objects, respectively. A is the Ground Truth (GT), B and C are the predicted results, respectively, but are not correct (3 and 6 pixels left and right respectively) from the GT.
Size of the object compared to the whole image (or video)
Speaking of small objects, there is really no rule to explicitly define an object as small or not with a particular sample. With SOD, small objects are usually about 30×30 in size, occupying less than <1% of the image area (256×256 – 512×512), often the object will be quite faint but still recognizable to the naked eye or based on the image. on the surrounding environment. With Super-resolution, usually the image will have a size of about 5000×5000 pixels (maybe up to 10000×10000 pixels or more), so it is called a small object but can be considered as a suitable size for the data. In conventional images, the difficulty here is the ability to separate the features of the model when the amount of computation for this type of problem is large. But with Super-resolution data (5000×5000) the object is as small as the normal data (30×30), accounting for approximately 0.004% of the entire image. With the disadvantage of the combined problem, solving the problem will become much more difficult.
In recent years, we have seen the development and contribution of many large and diverse datasets, contributing to the evaluation and strong development of models in the field of Object Detection, the familiar data can be mentioned as COCO, Pascal VOC, ImageNet, etc. But unfortunately, the data provided for SOD and Super-resolution problems is not much and not diverse in purpose, mainly objects of sets This data is aimed at pedestrians, vehicles, houses, etc. and is usually data in the form of satellites.
In this article, I will introduce an overview of the studies as well as evaluate each study and method for each of the mentioned problems: SOD, Super-resolution and a combination of the two.
The main methods of Super-resolution Object Detection
Current Object Detection models will usually be classified into two categories: Two-stage detection and One-stage detection. Two-stage detection will generate good region proposals through the RPN (Region Proposal Network) architecture layer, then the detection head will include those region proposals to partition and identify objects, some popular models are family R-CNN, FPN, etc. Different from two-stage detection, one-stage detection will extract features according to grid-map with predefined anchors box, thereby partitioning and identifying objects directly from those characteristics, some popular models such as YOLO family, SSD, etc. Choosing one of the two models will have different trade-offs, when one-stage detection will give priority to processing speed and two-stage detection will give more accurate results. Of course, in addition to the two common types of models, there are other types of Object Detection models such as Anchor-free (YoloX, CenterNet, ...), Query-based (DETR), ...
Data manipulation methods
In fact, it is extremely difficult for a dataset to be balanced in terms of the entire class as well as the size, small-sized objects often occupy a relatively small number compared to the common ground (often found in datasets) so often the model will automatically ignore those objects when the results (precision and recall) remain stable (because the number of small objects is small).
Oversampling-based augmentation strategy
Several methods of Oversampling augmentation. Above image: Copy-paste augmentation;
Below image: Mosaic augmentation
Copy objects, scale and paste to suitable positions at different samples (copy-paste augmentation), adjust the input data when training will become more balanced. Stitching method (mosaic) stitches a certain number of images together, reduces the image size to the original size, creates a new sample, helps to increase the number of small-sized objects by reducing the size of the original objects in the image (through compositing and resizing images).
Automatic augmentation scheme
In addition to customizing the conditions to create suitable augmentation methods according to the matching rules, we have algorithms to automatically search for those conditions so that the results of the augmentation method are optimal. Therefore, most current automatic augmentation methods are parameter optimization problems and Reinforcement learning is a fairly effective automatic search method (AutoAugment) but requires a huge amount of computation because the search space is often very large (compared to the size of the image, the number of samples and the number of classes respectively).
Objects in the image often vary in size and width-to-length ratio, especially for Super-resolution images, the difference is larger, making the single detector model quite difficult to predict. It is often much easier for a model, from empirical studies, to predict objects with a fixed size range (or with low variability) than with data with a variable size range. Because of its high dynamics (diverse in size), current methods will process the image (or in the feature extraction process), for a given input the model (or part of the model) only needs to predict objects in a fixed size range, too large or too small objects can be ignored, these cases will be handled with a different input (same sample).
To kinds of multi-scale detection methods. a, image pyramid methods. b, features pyramid methods
The solution that can be thought of is to divide an image into many different scales (image-pyramid) to predict each scale one by one (TridentNet, PANet, QueryDet, ...) or will predict on the output feature map of each layer (feature- pyramid) (HyperNet, SSD, YOLO, ...).
Tailored training schemes
Prediction process of Tailored training schemes
In the image, the number of pixels of the background is extremely large, so trying to do feature extraction in regions of non-object space seems superfluous, based on that idea, the related methods will usually work as follows: sample image and scale to a fixed size -> identify objects and areas that can contain objects -> separate areas that can contain objects, creating a new sample.
Methods belonging to the SNIP family (Scale Normalization for Image Pyramids) apply this process to simultaneously predict the object and predict the chips (smaller image samples from the original sample image) to continue using for the next cycle.
<To be continued>
The article has presented an overview of SOD and Super-resolution as well as studies on related data, problems and models. Algorithm classes are diverse, but none of them is perfect, each direction has its own strengths and weaknesses and depends a lot on the problem we want to aim for.
We will continuously update more SOD and Super-resolution as you expected. Please stay tuned and keep up date with the latest knowledge on our Blog.