YOLOv8 Architecture Explained!

Abin Timilsina
10 min readMar 17, 2024

--

Ultralytics logo with YOLOv8

What is YOLOv8 ?

YOLOv8, developed by Ultralytics, is a state-of-the-art object detection model pushing the boundaries of speed, accuracy, and user-friendliness. The acronym YOLO, which stands for “You Only Look Once,” enhances the algorithm’s efficiency and real-time processing capabilities by simultaneously predicting all bounding boxes in a single network pass. In comparison, several other object detection techniques require the detection process to go through several phases or processes.The popular YOLOv5 architecture is expanded upon in YOLOv8, which offers enhancements in these areas. The primary distinction between the YOLOv8 model and its predecessor is the utilization of anchor-free detection, which expedites the Non-Maximum Suppression (NMS) post-processing.YOLOv8 can identify and locate objects in images and videos with impressive speed and precision and tackles tasks like image classification and instance segmentation.

YOLOv8 Variants

All of these models belong to the YOLOv8 family, each variant offers different trade-offs between accuracy, speed, and model size. The variants are divided based on the difference in the value of the parameters like depth_multiple (d), width_multiple (w), and max_channel (mc).

depth_multiple(d): depth_multiple parameter determines how many Bottleneck Blocks are used in the C2f block. This scales the number of layers in the network. A value less than 1 reduces the depth (fewer layers), making the model smaller and faster but potentially less accurate. Conversely, a value greater than 1 increases the depth (more layers), leading to a larger and potentially more accurate model but slower to run.

width_multiple (w): This scales the number of channels in the convolutional layers. A value less than 1 thins the network (fewer channels), resulting in a smaller and faster model but potentially sacrificing some accuracy. On the other hand, a value greater than 1 widens the network (more channels), creating a larger and potentially more accurate model but requiring more processing power.

max_channels (mc): This parameter sets an upper limit on the number of channels allowed in the network. It is a safety measure to prevent the model from becoming too wide (too many channels) especially when width_multiple is set high. This can help control the model size and prevent overfitting.

Types of YOLOv8:

  • n: smallest model, fastest inference but lowest accuracy
  • s: small model, good balance of speed and accuracy
  • m: medium model, higher accuracy than small models with moderate inference speed
  • l: large model, highest accuracy but slowest inference
  • xl: extra-large model, best accuracy for resource-intensive applications

Blocks used in YOLOv8 Architecture

Before taking a deep dive into the architecture of YOLOv8, we have to learn about the basic blocks used in the architecture.

Convolutional Block (Conv Block)

It is the most basic block in the architecture which consists of the Conv2d layer, BatchNorm2d layer, and SiLU activation function.

Conv2d Layer: Convolution is a mathematical operation that involves sliding a small matrix (called a kernel or filter) over the input data, performing element-wise multiplication, and summing the results to produce a feature map. The “2D” in Conv2D refers to the fact that the convolution is applied in two spatial dimensions, typically height and width.

  • k: Number of filters or kernels. It represents the depth of the output volume, and each filter is responsible for detecting different features in the input.
  • s: Stride. It is the step size at which the filter/kernel slides over the input. A larger stride reduces the spatial dimensions of the output volume.
  • p: Padding. Padding is the additional border of zeros added to the input on each side. It helps preserve spatial information and can be used to control the spatial dimensions of the output volume.
  • c: Number of channels in the input. For example, in an RGB image, c would be 3 (one channel for each color: red, green, and blue).

BacthNorm2d Layer: Batch Normalization (BatchNorm2d) is a technique used in deep neural networks to improve training stability and convergence speed. In the context of convolutional neural networks (CNNs), the BatchNorm2d layer specifically applies batch normalization to 2D inputs, which are typically the outputs of convolutional layers. It ensures that the numbers going through the network aren’t too big or too small. This helps in preventing problems during training.

SiLU Activation Function: SiLU, which stands for Sigmoid Linear Unit, is an activation function used in neural networks. It is also known as the Swish activation function.

The SiLU activation function is defined as follows:

SiLU(x)=x⋅σ(x)

where σ(x) is the sigmoid function, which is given by:

σ(x)=1/(1+e^-x)​

The key characteristic of SiLU is that it allows for smooth gradients, which can be beneficial during the training of neural networks. Smooth gradients can help avoid issues like vanishing gradients, which can impede the learning process in deep neural networks.

Bottleneck Block

The bottleneck block consists of the Conv Block with a shortcut connection. If the shortcut=true then the shortcut is implemented in the bottleneck block else the input is passed through two Conv Blocks in a series.

Shortcut Connection: The shortcut connection, also known as a skip connection or residual connection, is a direct connection that bypasses one or more layers in the network. It allows the gradient to flow more easily through the network during training, addressing the vanishing gradient problem and making it easier for the model to learn.

In the specific context of a bottleneck block, the shortcut connection allows the model to bypass the convolutional blocks if necessary. This way, the model can choose to use the identity mapping provided by the shortcut, making it easier to learn the identity function when needed. The inclusion of a shortcut connection enhances the ability of the model to learn complex representations and improves the training of deep CNNs preventing vanishing gradient problems.

What is the vanishing gradient problem?
The vanishing gradient problem is a challenge that arises during the training of deep neural networks, particularly in architectures with many layers. It occurs when the gradients of the loss function concerning the parameters (weights) of the network become extremely small as they are backpropagated from the output layer to the input layer during the training process.

C2f Block

C2f block consists of a convolutional block which then the resulting feature map will be split. One feature map goes to the Bottleneck block whereas the other goes directly to the Concat block. In the C2f block, the number of the Bottleneck blocks used is defined by the depth_multiple parameter of the model. At the end, the feature map from the bottleneck block and the split feature map are concatenated and inputted into a final convolutional block.

Spatial Pyramid Pooling Fast (SPPF) Block:

The SPPF Block consists of a convolutional block followed by three MaxPool2d layers. Every resulting feature map from the MaxPool2d layer is then concatenated at the end and fed to a convolutional block.

The basic idea behind Spatial Pyramid Pooling is to divide the input image into a grid and pool features from each grid cell independently, allowing the network to handle images of different sizes effectively.

In essence, Spatial Pyramid Pooling enables neural networks to work with images of different resolutions by capturing multi-scale information through pooling operations at different levels of granularity. This can be particularly useful in tasks such as object recognition, where objects may appear at different scales within an image

While SPP offers advantages, it can be computationally expensive. SPP-Fast addresses this by using a simpler pooling scheme. Instead of using multiple pooling levels with different kernel sizes, SPP-Fast might use a single fixed-size kernel for pooling, reducing the number of computations needed. SPP-Fast offers a trade-off between accuracy and speed.

MaxPool2d Layer: Pooling layers are used to downsample the spatial dimensions of the input volume, reducing the computational complexity of the network and extracting dominant features. Max pooling is a specific type of pooling operation where, for each region in the input tensor, only the maximum value is retained, and the other values are discarded.

In the case of MaxPool2d, the pooling is applied in both the height and width dimensions of the input tensor. The layer is defined by specifying parameters such as the size of the pooling kernel and the stride. The kernel size determines the spatial extent of each pooling region, and the stride determines the step size between successive pooling regions.

Detect Block

Detect Block is responsible for the detection of the objects. Unlike in previous versions of YOLO, YOLOv8 is an anchor-free model which means it predicts directly the center of an object instead of the offset from a known anchor box. Anchor-free detection reduces the number of box predictions, which speeds up complicated post-processing steps that sift through candidate detections after inference.

The Detect Block contains two tracks. The first track is for bounding box predictions and the second track is for class predictions. Both tracks contain two convolutional blocks followed by a single Conv2d layer which gives the Bounding Box loss and Class Loss respectively.

YOLOv8 Architecture Explained

YOLOv8 Architecture consists of three main sections: Backbone, Neck, and Head.

Backbone is the deep learning architecture that acts as a feature extractor of the inputted image.

Neck combines the features acquired from the various layers of the Backbone module.

Head predicts the classes and the bounding box of the objects which is the final output produced by the object detection model.

Backbone Section:

In Block 0, the processing starts with the input image size of 640 x 640 x 3 which is fed to the convolutional block with kernel size 3, stride 2, and padding 1. The spatial resolution is reduced when stride= 2 is used. The convolutional block produces the feature map of 320 x 320 because the kernel moves in 2-pixel increments.

To obtain the output channel of the convolution block, the following formula is used :

min(64,mc)*w

Here,

64 is the base output channel

mc is the max_channel

w is the width_multiple

For example, if we are using the “n” variant YOLOv8 model then our final output channel becomes = min(64,1024)*0.25 = 64*0.25 = 16

Likewise, this operation is calculated in every convolutional block present in the architecture.

Block 2, is a C2f block that contains two parameters i.e. shortcut and n. Here, the shortcut is the boolean parameter that denotes if the Bottleneck block utilizes the shortcut or not. If the value of the shortcut= true then the bottleneck block inside the C2f block utilizes the shortcut else it doesn’t.

Here, n determines how many bottleneck blocks are used inside the C2f block. In the case of Block 2, n is given by:

n= 3*d

where d= depth_multiple

For example, if we are using the “n” variant YOLOv8 model the depth_multiple of the “n” type YOLOv8 model is 0.33 so,

the number of bottleneck block used inside the C2f becomes (n) = 3* 0.33

=0.99 i.e. 1 bottleneck block is used.

In the C2f block the resolution of the feature map and the output channel is unchanged.

In Block 9, the SPPF Block is used after the last convolution layer of the C2f block in the Backbone.

The main function of the SPPF block is to generate the fixed feature representation of the object in various sizes in an image without resizing the image or introducing spatial information loss.

Neck and Head Section

The neck section is responsible for upsampling the feature map and combining the features acquired from the various layers of the Backbone section.

The upsample layer present in the Neck section simply increases the feature map by double without making any changes in the output channel.

Concat Block sums off the output channels of the blocks that are being concatenated without any change in resolution.

The head section is responsible for predicting the classes and the bounding box of the objects which is the final output produced by the object detection model.

The first Detect block in the Head section specializes in detecting small objects that are inputted from the C2f block present in Block 15.

The second Detect block in the Head section specializes in detecting medium-sized objects which is inputted from the C2f block present in Block 18.

The third Detect block in the Head section specializes in detecting small objects that are inputted from the C2f block present in Block 21.

Conclusion

In conclusion, YOLOv8, an evolution of the YOLO family, redefines object detection with its anchor-free architecture, balancing speed and accuracy across various model variants. Utilizing convolutional and bottleneck blocks, alongside innovative features like Spatial Pyramid Pooling Fast, YOLOv8 efficiently processes images for real-time detection. Its backbone, neck, and head sections synergize to extract features, upsample, and predict classes and bounding boxes. With its versatility, YOLOv8 offers a range of models catering to diverse needs, from rapid inference to high accuracy. Overall, YOLOv8 represents a pinnacle in object detection, empowering applications with unparalleled performance and user-friendliness.

--

--