04.Image Segmentation - polimi-notes

# Image Segmentation Image segmentation involves partitioning an image into multiple segments or groups of pixels, aiming to identify pixel groups that belong together: - Group similar-looking pixels for efficiency. - Divide images into coherent objects. Actually image segmentation can be divided into: - **Semantic Segmentation**: Assigns a label to each pixel in an image, without differentiating between different instances of the same object within a category. - **Instance Segmentation**: Similar to semantic segmentation but also differentiates between different instances of the same object/category. There are two types of segmentation: - **Unsupervised Segmentation**: Group similar-looking pixels based on characteristics like color, texture, etc., without using any training data. - **Supervised Segmentation**: Involves using labelled training data where each pixel is associated with a class label. ## Simple semantic segmentation approaches Simple solutions are: - **Direct heatmap predictions**: Assigning predicted labels from a heatmap to corresponding receptive fields. Provides a coarse estimate. - **Only convolutions**: Avoids pooling; uses only convolutional and activation layers. Results in a small receptive field and inefficiency. - **Shift and Stich**: Assume there is a down sampling ratio $f$ between the size of input and of the output heatmap. - Uses a down sampling ratio $f$ between the input and output heatmap. - Computes heatmaps for all $f^2$ shifts of the input. - Interleaves the heatmaps to form a full-sized image. ### Heatmaps **Heatmaps** are visual data displays. They use colour gradients to represent different data values. It's possible to use heatmaps to show how convolutional filters are activated in an image. The heatmap pixels, corresponds to **receptive fields**, mark where these activations occur. These fields essentially indicate the precise spots in the input image where the associated features are detected or activated most frequently. For image segmentation, the last layer of a CNN generates a heatmap with a unique channel per class. Each heatmap slice reflects the probability (between $0$ and $1$) of a pixel belonging to a specific class. These values are generally lower resolution than the original image. ### Up-sampling Up-sampling is a process that is used in convolutional neural networks (CNNs) to increase the size of the feature maps produced by the network, particularly useful in heatmap prediction. Up-sampling is essentially the opposite of pooling or down-sampling, increasing output resolution. **Image segmentation**'s primary goal necessitates the classification of each pixel in an image, often dealing with the tension between **location** (local information 'where') and **semantics** (global information 'what'). The key to balancing the location/semantics tension is the combination of coarse and fine layers. The first part of the architecture uses deep features to encode semantic information. The second half is designed to **up sample** the predictions for each image pixel. As the network gets deeper, the extracted features become more abstract and semantically rich, helping the network to identify complex patterns. Once the semantics are encoded in the deep layers, the network includes layers for feature map **up-sampling**. Multiple techniques can be applied at this stage. ![](images/f445cf23614f05113626a2e81c5cce5a.png) Different way to do up-sampling: 1. **Nearest Neighbour Up-sampling**: This is a simpler method of increasing the size of an image or feature map by replicating the values of the original pixels. It involves resizing the original image or feature map by inserting zero-valued pixels between the original pixels and then setting the value of each new pixel to the value of the closest original pixel. It is a simple and fast method of upsampling that can be used in a convolutional neural network (CNN). 2. **Max Unpooling**: This involves saving the positions of the maxima during max pooling and then placing the values back into their original positions during max unpooling, thereby recovering spatial position information. 3. **Transpose Convolution**: Increases the surface area of the input volume (by adding zeros) followed by convolution. This method increases the size of the output through **learnable filters**, which are learned during training, thereby increasing the surface area while reducing depth. 4. **Prediction Upsampling**: Employs convolutions with fractional stride filters to enlarge the image. However, the results from this method are often not optimal. 5. **Skip Connections**: In architectures like U-Net, skip connections directly connect the feature maps from downsampling (encoder) layers to the upsampling (decoder) layers. Skip connections indirectly improve the quality of the upsampling process. They enable the network to use fine-grained details along with high-level semantic information, which is especially important in tasks like super-resolution or any task that requires detailed reconstruction of an image. ### U-Net U-Net is a robust, efficient, and highly adaptable architecture that has set a standard in medical image segmentation and has applications in various image segmentation tasks. U-Net has a distinctive "U-shaped" architecture. It consists of two paths: a contracting (downsampling) path and an expansive (upsampling) path, which gives it the U-shaped appearance. Its design effectively combines these two paths: - the context captured in the downsampling path: - The contracting path is a part of the convolutional network that includes repeated application of convolution and pooling operations to reduce the size of the feature map. - The number of feature channels is doubled at each downsampling step. - the localization ability in the upsampling path: - The upsampled feature map is then concatenated with the corresponding cropped feature map from the contracting path. A key component of the U-Net is the **skip connections** that connect the contracting path to the expansive path. These connections provide the expansive path with context information to better localize and learn representations. Image segmentation models like as U-Net, typically have an output shape of (`Height, Width, C`), where `C` is the number of classes. Each pixel in the output has `C` values, indicating the probability distribution across the classes. A softmax activation function is often used across the `C` channels.