Note: This note is from the Deeplizard Intro to Deep Learning course
An artificial neural network is a computing system comprised of connected neurons organized into layers - an input layer, hidden layers, and an output layer. Neurons are also called nodes. The number of nodes in each layer is chosen arbitrarily, except for the input and output layers which match the input data and desired outputs.
![[Pasted image 20241125122618.png]]
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential(layers)
layers = [
Dense(units=3, input_shape=(2,), activation='relu'),
Dense(units=2, activation='softmax')
]There are, however, different types of layers. Some examples include:
- Dense (or fully connected) layers
- Convolutional layers -> suitable for dealing with images
- Pooling layers
- Recurrent layers -> suitable for time series data
- Normalization layers
![[Pasted image 20241125124144.png]]
Given this image the Keras code implementation will be:
from keras.models import Sequential
from keras.layers import Dense, Activation
layers = [
Dense(units=6, input_shape=(8,), activation='relu'),
Dense(units=6, activation='relu'),
Dense(units=4, activation='softmax')
]
model = Sequential(layers)Activation Functions
Activation functions transform the weighted sum of a neuron's inputs into its output. This allows neurons to model non-linear relationships.
The sigmoid activation function outputs values between 0 and 1, while the ReLU (rectified linear unit) activation function outputs 0 for negative inputs and the input value for positive inputs.
Using non-linear activation functions, like sigmoid and ReLU, enables neural networks to learn arbitrarily complex functions, unlike purely linear models.
why use activation function? https://deeplizard.com/learn/video/m0pIlLfpXWE
# Another way to code the layers in keras
from keras.models import Sequential
from keras.layers import Dense, Activation
# Approach 1
model = Sequential([
Dense(units=5, input_shape=(3,), activation='relu')
])
# Approach 2
model = Sequential()
model.add(Dense(units=5, input_shape=(3,)))
model.add(Activation('relu'))Training a Neural Network
Training a neural network is about solving optimization problems - optimize the weights with in the model by finding the weights that most accurately map the input data to the correct output class.
Optimization algorithm
Weights are optimized using optimization algorithm. The algorithm (optimizer) objective of eg. SDG is to minimize some given function that we call a loss function and update the model's weights to ensure the loss function is as close to the minimum value as possible.
The most widely known optimizer is called stochastic gradient descent (SGD).
Loss function
Measures how far off a model's predictions are from the true target values. The lower the loss the better the model's performance.
In a regression task (predicting a number), a common loss function is Mean Squared Error (MSE), which calculates the average squared difference between predicted and actual values.
$$ MSE(input)= (output - label)^2 $$
In a classification task (predicting a category), Cross-Entropy Loss measures how well the predicted probabilities align with the true class labels.
[!note] The continuous passing of the same data over and over again through the model is what is termed as training. It's during the training phase where our model is actually learning.
Gradient of the loss function
After the loss is calculated, the gradient of this loss function is computed with respect to each of the weights within the network.
[!Note] gradient is just a word for the derivative of a function of several variables. Uses a technique called backpropagation to calculate the gradient of loss w.r.t given weights.
The gradient tells us which direction will move the loss towards the minimum, and our task is to move in a direction that lowers the loss and steps closer to this minimum value with the gradient computation on the loss.
Learning rate
We then multiply the gradient value by something called a learning rate. A learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary. The learning rate tells us how large of a step we should take in the direction of the minimum.
Updating the weights
Conceptually, we can think of the learning rate of our model as the step size. $$ \text{new weight}=\text{old weight}−(\text{learning rate}∗ gradient) $$
[!note] this just focused on one single weight to explain the concept, but this same process is going to happen with each of the weights in the model each time data passes through it.
The model is learning
This updating of the weights is essentially what we mean when we say that the model is learning. It's learning what values to assign to each weight based on how those incremental changes are affecting the loss function. As the weights change, the network is getting smarter in terms of accurately mapping inputs to the correct output.
import tensorflow as tf
from tf.keras.models import sequential
from tf.keras.layers import Dense, Activation
from tf.keras.optimizers import Adam
from tf.keras.metrics import categorical_crossentropy
model = sequential([
Dense(units=16, input_shape=(1,), activation='relu'),
Dense(units=32, activation='relu'),
Dense(units=2, activation='sigmoid')
])
# Before training the model, it has to be compiled.
model.compile(
optimizer=Adam(learning_rate=0.0001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Finally fit the model to the data a.k.a train the model on the data
model.fit(
x=scaled_train_samples,
y=train_labels,
batch_size=10,
epochs=20,
shuffle=True,
verbose=2
)scaled_train_samplesis a numpy array consisting of the training samples.train_labelsis a numpy array consisting of the corresponding labels for the training samples.batch_size=10specifies how many training samples should be sent to the model at once.epochs=20means that the complete training set (all of the samples) will be passed to the model a total of 20 times.shuffle=Trueindicates that the data should first be shuffled before being passed to the model.verbose=2indicates how much logging we will see as the model trains.
Datasets for Deep Learning
For training and testing purposes we break the datasets into three:
- Training -> set of data used to train the model
- Validation -> set of data separate from the training set that is used to validate our model during training. Help to give information for getting optimal hyperparameters and to prevent overfitting and underfitting.
- Testing -> set of data used to test the model after it has already been trained.
[!note] Only the Training set updates the weights of the network. The test set does not have corresponding labels as we have in the train and test datasets.
Predicting with a Neural Network
Predictions are based on what the model learned during training. They are performed on the testing dataset i.e. without labels
predictions = model.predict(
x=scaled_test_samples,
batch_size=10,
verbose=0
)
# This will return a tuple of the probabilies of each sample
# [ 0.7410683 0.2589317]
# [ 0.14958295 0.85041702]
for p in predictions:
print(p)Overfitting
Overfitting happens when the model becomes very good in classifying on data included in the training set but not so good on classifying data it hasn't seen.
- This can be detected if the validation metrics is worst than that of the training metrics
Common ways to deal with overfitting includes:
- adding more training sets as it will add more diversity to the training data.
- data augmentation - modifying the data in the training set
- dropout - randomly ignoring some neurons in the ANN during training
Underfitting
Occurs when a model is not able to classify the data is was trained on well let alone data it hasn't seen before. This can be detected if the metric of of the training data is poor or the training loss is very high.
To deal with underfitting:
- increase the complexity of the model
- add more features to the input samples
- reduce the dropout nodes
Supervised learning
Occurs when the data in the training set is labelled. With supervised learning, each piece of data passed to the model during training is a pair that consists of the input object, or sample, along with the corresponding label or output value.
the labels can be encoded as 0 and 1 representing the respective inputs.
All is needed to train with the neural network with supervised learning to provide train and test samples.
train_samples = np.array([
[150, 67],
[130, 60],
[200, 56],
[125, 52],
[230, 72],
[181, 70]
])
train_labels = np.array([1, 1, 0, 1, 0, 0])
model.fit(
x=train_samples,
y=train_labels,
batch_size=3,
epochs=10,
shuffle=True,
verbose=2)Unsupervised Learning
Occurs when the data in the training set is not labelled. There is no way to measure the accuracy when training with unsupervised learning as the data has not labels.
The model is given unsupervised dataset and the models tries to learn some useful features from it. This idea is used for clustering algorithms and used for autoencoders.
Autoencoders are artificial neural networks that takes in input and outputs a reconstruction of this input. The goal is to make the reconstructed output as close as the original input as possible.
Semi-supervised learning
Take a middle ground between supervised and unsupervised learning. Uses the combination of supervised and unsupervised learning.
Pseudo-Labeling
Some portion of the dataset is labeled and used to train the model. The train model is then used to predict on the remaining unlabeled portion of data.
Pseudo-labeling allows us to train on a vastly larger dataset.
Data Augmentation
Creating new data based on the modifications of the existing data. Example, flipping or rotating images...
Could be useful for
- creating new data
- reducing overfitting
[!warning] Not all data augmentation techniques may not be appropriate to use on a given data set.
One-hot Encoding
Is a type of encoding that is widely used for encoding categorical data with numerical values is called one-hot encoding.
One-hot encodings transform our categorical labels into vectors of 0s and 1s. The length of these vectors is the number of classes or categories that our model is expected to classify.
For one-hot encodings for multiple categories, one of the indices of the vector is hot! i.e. 1
| Label | Index-0 | Index-1 | Index-2 | Vector |
|---|---|---|---|---|
| Cat | 1 | 0 | 0 | [1,0,0] |
| Dog | 0 | 1 | 0 | [0,1,0] |
| Lizard | 0 | 0 | 1 | [0,0,1] |
Convolutional Neural Networks (CNNs)
CNNs are artificial neural networks that has so far been most popularly used for analyzing images for computer vision tasks due to the ability to pick out or detect patterns.
What makes CNNs different from multilayer [[Neural Networks#Perceptron|perceptron]] or MLP is that, they have hidden layers called convolutional layers.
Convolutional layers take input (input channels) and transform them to produce an output (output channel). The convolution operation performed are cross-correlations.
[!note] With each convolutional layer, we need to specify the number of filters the layer should have. These filters are actually what detect the patterns.
[!note] The number of filters determines the number of output channels.
A filter (pattern detectors) can technically just be thought of as a relatively small matrix ( tensor), for which, we decide the number of rows and columns this matrix has, and the values within this matrix are initialized with random numbers. Pattern detectors emerge as the network learns.
A feature map is the output channel that results from the filter convolving the entire input. The feature map is then used as an input to other CNN layers in the network.
The dot product
Suppose we have a two 3 x 3 matrices A and B as follows.
$$ A= \begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} \ a_{2,1} & a_{2,2} & a_{2,3} \ a_{3,1} & a_{3,2} & a_{3,3} \end{bmatrix} $$
$$ B= \begin{bmatrix} b_{1,1} & b_{1,2} & b_{1,3} \ b_{2,1} & b_{2,2} & b_{2,3} \ b_{3,1} & b_{3,2} & b_{3,3} \end{bmatrix} $$
Then we sum the pairwise products like this:
$$ a_{1,1}b_{1,1}+a_{1,2}b_{1,2}+\cdots +a_{3,3}b_{3,3} $$
Technically this operation is the summation of the element-wise products. Even so, you may still encounter the term "dot product" used loosely to refer to this operation.
[!note] you may also see this operation referred to as the Frobenius inner product or the summation of the Hadamard product as well.
Zero Padding
Is a technique that allows the preservation of the original input size as the input channel convolves through the CNN layers.
The problem is that as the resulting output gets convolved through the network, is going to continue to become smaller and smaller and some information can be lost. Hence the need to zero padding to solve this issue.
In general, if our image is of size n x n, and we convolve it with an f x f filter, then the size of the resulting output is $(n–f+1) x (n–f+1)$.
There are two types of padding, valid -> no padding (the input size is not maintained) and same -> padding to make the output size same as input size.
[!note] Sometimes you need to add more that a single padding across the original input channel. Libraries such as tensorflow automatically determines the padding if you want to use them.
model_valid = Sequential([
Dense(16, input_shape=(20,20,3), activation='relu'),
Conv2D(32, kernel_size=(3,3), activation='relu', padding='valid'),
Conv2D(64, kernel_size=(5,5), activation='relu', padding='valid'),
Conv2D(128, kernel_size=(7,7), activation='relu', padding='valid'),
Flatten(),
Dense(2, activation='softmax')
])Summary Results
> model_valid.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 20, 20, 16) 64
_________________________________________________________________
conv2d_1 (Conv2D) (None, 18, 18, 32) 4640
_________________________________________________________________
conv2d_2 (Conv2D) (None, 14, 14, 64) 51264
_________________________________________________________________
conv2d_3 (Conv2D) (None, 8, 8, 128) 401536
_________________________________________________________________
flatten_1 (Flatten) (None, 8192) 0
_________________________________________________________________
dense_3 (Dense) (None, 2) 16386
=================================================================
Total params: 473,890
Trainable params: 473,890
Non-trainable params: 0
_________________________________________________________________
Here, the output shape decreases as it convolves through the layers
model_same = Sequential([
Dense(16, input_shape=(20,20,3), activation='relu'),
Conv2D(32, kernel_size=(3,3), activation='relu', padding='same'),
Conv2D(64, kernel_size=(5,5), activation='relu', padding='same'),
Conv2D(128, kernel_size=(7,7), activation='relu', padding='same'),
Flatten(),
Dense(2, activation='softmax')
])Summary Results
> model_same.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_6 (Dense) (None, 20, 20, 16) 64
_________________________________________________________________
conv2d_7 (Conv2D) (None, 20, 20, 32) 4640
_________________________________________________________________
conv2d_8 (Conv2D) (None, 20, 20, 64) 51264
_________________________________________________________________
conv2d_9 (Conv2D) (None, 20, 20, 128) 401536
_________________________________________________________________
flatten_3 (Flatten) (None, 51200) 0
_________________________________________________________________
dense_7 (Dense) (None, 2) 102402
=================================================================
Total params: 559,906
Trainable params: 559,906
Non-trainable params: 0
_________________________________________________________________
Here, the output shape remains the same as it convolves through the CNN layers
Max Pooling
[[Neural Networks#Max-pooling|Max pooling]] is a type of operation that is typically added to CNNs following individual convolutional layers. It's used to reduce the dimensionality of the output channel after convolution and also use to reduce overfitting.
Max pooling works like this. We define some n x n region as a corresponding filter for the max pooling operation. We're going to use 2 x 2 in this example.
We define a stride, which determines how many pixels we want our filter to move as it slides across the image.
[!note] Stride determines how many units the filter slides.
On the convolutional output, and we take the first 2 x 2 region and calculate the max value from each value in the 2 x 2 block. This value is stored in the output channel, which makes up the full output from this max pooling operation.
We move over by the number of pixels that we defined our stride size to be. We're using 2 here, so we just slide over by 2, then do the same thing. We calculate the max value in the next 2 x 2 block, store it in the output, and then, go on our way sliding over by 2 again.
Once we reach the edge over on the far right, we then move down by 2 (because that's our stride size), and then we do the same exact thing of calculating the max value for the 2 x 2 blocks in this row.
model_valid = Sequential([
Dense(16, input_shape=(20,20,3), activation='relu'),
Conv2D(32, kernel_size=(3,3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'),
Conv2D(64, kernel_size=(5,5), activation='relu', padding='same'),
Flatten(),
Dense(2, activation='softmax')
])Here the padding for the max pooling layer is set to valid as usually no padding is required for this operation.
Summary Results
> model_valid.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 20, 20, 16) 64
_________________________________________________________________
conv2d_1 (Conv2D) (None, 20, 20, 32) 4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 10, 10, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 10, 10, 64) 51264
_________________________________________________________________
flatten_1 (Flatten) (None, 6400) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 12802
=================================================================
Total params: 68,770
Trainable params: 68,770
Non-trainable params: 0
_________________________________________________________________
Backpropagation
Backpropagation (backward propagation of errors) is the core algorithm for training neural networks. It calculates the gradient of the loss function with respect to each weight in the network, enabling the model to learn by adjusting weights to minimize the loss.
Why We Need It
- Neural networks learn by optimizing weights to reduce prediction errors (loss). Without knowing how much each weight contributes to the error, we can't update them effectively.
- Backpropagation efficiently computes these contributions (gradients) for all weights, layer by layer, using the chain rule of calculus. This is crucial because manually tweaking weights or computing gradients for complex networks would be impractical.
How It Works
- Forward Pass: Input data passes through the network, producing an output and a loss (e.g., MSE or cross-entropy) by comparing the output to the target.
- Backward Pass: The loss is propagated backward through the network. For each weight, we compute the partial derivative of the loss with respect to that weight. $$(\frac{\partial L}{\partial w})$$
- Weight Update: These gradients are used to adjust weights via an optimization algorithm like SGD.
Stochastic Gradient Descent (SGD) and Weight Updates
SGD is an optimization method that updates weights iteratively to minimize the loss. Unlike standard gradient descent (which uses the entire dataset), SGD uses a single sample or small batch, making it faster and suitable for large datasets.
How SGD Works Behind the Scenes
- Gradient Computation: For a given weight $w$, compute the loss $\frac{\partial L}{\partial w}$ using backpropagation.
- Update Rule: Adjust the weight using the gradient and a learning rate $\eta$:
$$
w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w}
$$
- $\eta$ controls the step size. Too large, and it overshoots; too small, and it's slow.
- Stochasticity: Repeat this for each sample (or batch), introducing randomness that helps escape local minima but may make convergence noisier.
Calculus of Gradient Computation
Let's illustrate with a simple network: one input (x), one weight (w), one neuron with activation (z = w \cdot x), an activation function (e.g., sigmoid (\sigma(z) = \hat{y})), and a loss (L) (e.g., MSE: (L = \frac{1}{2} (y - \hat{y})^2)).
-
Chain Rule: To find $\frac{\partial L}{\partial w}$, apply the chain rule backward: $$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w} $$
-
Step-by-Step:
- Loss derivative: For MSE, $L = \frac{1}{2} (y - \hat{y})^2$, $$ \frac{\partial L}{\partial \hat{y}} = \hat{y} - y $$
- Activation derivative: For sigmoid $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$, $$ \frac{\partial \hat{y}}{\partial z} = \sigma(z) \cdot (1 - \sigma(z)) = \hat{y} (1 - \hat{y}) $$
- Neuron derivative: Since $z = w \cdot x$, $$ \frac{\partial z}{\partial w} = x $$
-
Combine: $$ \frac{\partial L}{\partial w} = (\hat{y} - y) \cdot \hat{y} (1 - \hat{y}) \cdot x $$ This gradient tells us how much $w$ affects the loss.
-
Multi-Layer Case: For deeper networks, the process repeats layer by layer, propagating the error backward using the chain rule across all weights, summing contributions from downstream neurons.
This process repeats until the loss converges, tuning the network to fit the data.
Vanishing & Exploding Gradient
This is a problem resulting from backpropagation. In general, the vanishing gradient problem involves weights in earlier layers of the network as the gradient with respect to weights in earlier layers of the network becomes really small, like vanishingly small.
If the gradient is vanishingly small, then this update is, in turn, going to be vanishingly small as well.
Stuck Weights
Now, we can think about if the gradient that we obtain with respect to this weight is already really small, i.e., vanishing, then by the time we multiply it by the learning rate, the product is going to be even smaller, and so when we subtract this teeny tiny number from the weight, it's just barely going to move the weight at all.
Essentially, the weight gets into this kind of stuck state. Not moving, not learning, and therefore not really helping to meet the overall objective of minimizing the loss of the network.
The more terms we're multiplying together that are less than one, the quicker the gradient is going to vanish.
Exploding Gradient
Now think about calculating the gradient with respect to the same weight, but instead of really small terms, what if they were large? And by large, we mean greater than one.
Meaning if we multiply a bunch of terms together that are all greater than one, we're going to get something greater than one, and perhaps even a lot greater than one.
Hence instead of barely moving our weight with this update, we're going to greatly move it so much so that the optimal value for this weight won't be achieved. This is because the proportion to which the weight becomes updated with each epoch is just too large and continues to move further and further away from its optimal value.
Weight Initialization
We know weights are what connects the nodes between layers. The weight are randomly initialized with a mean of 0 and standard deviation of 1.
That implies that the weights from multiple nodes will be added together and that becomes the weight of the node of the next layer. hence the variance will be more that one and this can affect the network arising to issues such as [[#Vanishing & Exploding Gradient]].
To deal with this problem, we have to influence the variance to be smaller by shrinking the variance of the weights that is feeding in to the subsequent nodes.
This can be achieved by using the $\text{Xavier Initialization}$. The value for the variance of the weights connected to a given node is $\frac{1}{n}$, where $n$ is the number of weights connected to this node from the previous layer.
So, rather than the distribution of these weights be centered around $0$ with a variance of $1$, which is what we had earlier, they are now still centered around 0, but with a significantly smaller variance, $\frac{1}{n}$.
It turns out that, to get these weights to have this variance of $\frac{1}{n}$, what we do is, after randomly generating the weights centered around $0$ with variance $1$, we multiply each of them by $\sqrt{1/n}$. Doing this causes the variance of these weights to shift from $1$ to $\frac{1}{n}$. This type of initialization is referred to as Xavier initialization and also Glorot initialization.
[!note] For relu activation, the ideal value for the variance is $\frac{2}{n}$ rather than $\frac{1}{n}$.
When this Xavier initialization was originally announced, it was suggested to use $2/n_{in} + n_{out}$ as the variance where $n_{in}$ is defined as the number of weights coming into this neuron, and $n_{out}$ is the number of weights coming out of this neuron.
By default keras uses
glorot_unifromas the kernel_initializer.glort_normalcan also be explicitly provided.
How bias impacts training
Biases are learnable parameters in neural networks, assigned to each neuron, that enhance a model's flexibility by adjusting the activation threshold. Alongside weights, biases are updated during training via backpropagation and Stochastic Gradient Descent (SGD), allowing the network to better fit the data.
Role and Implementation
- Definition: A bias acts like a threshold, determining whether a neuron activates (fires) by shifting the input to the activation function.
- How It Fits: For a neuron, the bias is added to the weighted sum of inputs before passing it to the activation function (e.g., ReLU, sigmoid). Mathematically, if $z = \sum (w_i \cdot x_i)$ is the weighted sum, the activation input becomes $z + b$, where $b$ is the bias.
- Purpose: Without bias, the activation threshold is fixed (e.g., 0 for ReLU), limiting the model. Bias shifts this threshold, controlling when a neuron activates.
Example
Consider a neuron with inputs $x_1 = 1$, $x_2 = 2$, weights $w_1 = -2$, $w_2 = 1$, and ReLU activation $(\text{ReLU}(x) = \max(0, x))$:
- No Bias: Weighted sum = $(1 \cdot -2) + (2 \cdot 1) = -2 + 2 = 0$. $ReLU(0) = 0$, so the neuron doesn't fire.
- With Bias $b = 3$: New input = $0 + 3 = 3$. $ReLU(3) = 3$, so the neuron fires. The bias shifts the threshold from 0 to -3, increasing flexibility.
Why We Need Biases
- Biases allow the network to adjust what constitutes a "meaningful" activation, preventing neurons from being stuck at zero output (e.g., in ReLU) when the weighted sum alone isn't sufficient. This makes the model more expressive and capable of capturing complex patterns.
Learnable Parameters
A parameter that is learned by the network during training. It's also referred to as trainable parameters.
To calculate the number of learnable parameters, we need the following within an individual layer:
- The number of inputs to the layer.
- The number of outputs to the layer.
- Whether or not the layer contains biases.
Example Assuming the following network architecture: With bias terms
| Layer | Number of Nodes | Calculation |
|---|---|---|
| Input | 2 | $0;(\text{Since this is the original input})$ |
| Hidden | 3 | $2 * 3 + 3;(\text{bias terms from the hidden layer}) = 9$ |
| Output | 2 | 3 * 2 + 2;(\text{bias terms from the output layer}) = 8$ |
| The total learnable parameter will be $0 + 9 + 8 = 17$. |
Learnable Parameters in CNNs
In CNNs, the learnable parameters are the same as in the fully connected layers (weights and biases) but the kernel size and filters also needs to be considered.
Suppose we have a CNN made up of an input layer, two hidden convolutional layers, and a dense output layer.
![[Pasted image 20250302204429.png]]
The total learnable parameters are $0+56+57+2402 = 2515$.
Regularization
Regularization is a technique that helps reduce overfitting or reduce variance in our network by penalizing for complexity.
Adding a term to our $loss + x$ to penalize for large weights.
L2 regularization
The most common regularization technique is called L2 regularization. With L2 regularization, the term we're adding to the loss is the sum of the squared norms of the weight matrices
$\sum_{j=1}^{n}\left\Vert w^{[j]}\right\Vert ^{2},$
multiplied by a small constant
$\frac{\lambda }{2m}$.
Adding the term to the loss
Let's look at what L2 regularization looks like. We have
$$ loss + \left( \sum_{j=1}^{n}\left\Vert w^{[j]}\right\Vert ^{2}\right)\frac{\lambda }{2m}. $$
The table below gives the definition for each variable in the expression above.
| Variable | Definition |
|---|---|
| $n$ | Number of layers |
| $w^{[j]}$ | Weight matrix for the $j^{th}$ layer |
| $m$ | Number of inputs |
| $λ$ | Regularization parameter |
The term λ is called the regularization parameter, and this is another hyperparameter that we'll have to choose and then test and tune in order to choose the correct number for our specific model.
Impact of regularization
Well, using L2 regularization as an example, if we were to set λ to be large, then it would incentivize the model to set the weights close to zero because the objective of SGD is to minimize the loss function.
If λ is large, then this term, $\frac{\lambda }{2m}$, will continue to stay relatively large, and if we're multiplying that by the sum of the squared norms, then the product may be relatively large depending on how large our weights are. This means that our model is incentivized to make the weights small so that the value of this entire function stays relatively small in order to minimize loss.
Batch Size
Refers to the number of samples that will be passed through to the network at one time. The larger the batches the faster the network completes the training. However, if the batch size is too high, our model may not have the resources process all of them in parallel.
[!note] Batch Size != Epoch
batches in epoch = training set size / batch_size
Given
1000images of dogs and a batch size of10. This means that10images of dogs will be passed as a group, or as a batch, at one time to the network. For a single epoch, it will take100batches to make up full epoch. We have1000images divided by a batch size of10, which equals100total batches.
model = Sequential([
Dense(units=16, input_shape=(1,), activation='relu'),
Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
Dense(units=2, activation='sigmoid')
])
model.fit(
x=scaled_train_samples,
y=train_labels,
validation_data=valid_set,
batch_size=10,
epochs=20,
shuffle=True,
verbose=2
)Fine-tuning and transfer learning
Transfer learning occurs when we use knowledge that was grained from solving one problem and apply it to a new but related problem.
Fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task.
With fine tuning the model does not need to re-learn all the parameters eg. L2 regularization, filters, weights etc. but instead;
- input the original model
- replace the last layer or additional layers
- freeze the original layers and only update the ones we added during training.
- model learns to predict the new images.
Batch Normalization
Normalization in general is performed before training our network on the data.
$$z=\frac{x-mean}{std}$$
Normalization and standardization have the same objective of transforming the data to put all the data points on the same scale.
If during training, one of the weights ends up becoming drastically larger that the other weights. This can cause the output from its corresponding neuron to be extremely large, and this imbalance will, continue to cascade through the network, causing instability.
Batch normalization is applied to layers that we choose within our network. When applying batch norm to a layer,
| Step | Expression | Description |
|---|---|---|
| 1 | $z=\frac{x-mean}{std}$ | Normalize output $x$ from activation function. |
| 2 | $z*g$ | Multiply normalized output $z$ by arbitrary parameter $g$. |
| 3 | $(z*g) + b$ | Add arbitrary parameter $b$ to resulting product $(z∗g)$. |
$g$ and $b$ are trainable parameters (learned and optimized during training process)
model = Sequential([
Dense(units=16, input_shape=(1,5), activation='relu'),
Dense(units=32, activation='relu'),
BatchNormalization(axis=1),
Dense(units=2, activation='softmax')
])The axis(1=features) as well as the beta_initializer and gamma_initializer - the arbitrarily set parameters ($z$ and $b$) can be specified accordingly.
Glossaries
| Term | Meaning |
|---|---|
| Forward pass | The pass through the network from input to output, computing predictions. |
| Epoch | A single pass of the entire dataset through the network during training. |
| Trainable parameters | The weights and biases in the network that are adjusted during training. |
| Backward pass | The process of propagating the loss backward through the network to compute gradients (via backpropagation). |
| Batch normalization | A technique to normalize layer inputs within a batch, improving training speed and stability. |
| Fine tuning | Adjusting a pre-trained model on a new task or dataset, typically with smaller updates. |
| Bias | A learnable parameter per neuron that shifts the activation threshold, increasing model flexibility. |
| Overfitting | When a model learns the training data too well, including noise, and fails to generalize to new data. |
| Underfitting | When a model fails to capture the underlying patterns in the training data, performing poorly overall. |
| Batch size | The number of samples processed in one forward and backward pass during training. |
| Convolution | A sliding filter operation in CNNs to extract local features (e.g., edges) from input data. |
| Pooling | A downsampling operation (e.g., max pooling) in CNNs to reduce spatial dimensions and retain key features. |
| Feature map | The output of a convolution or pooling layer, representing detected features in the input. |
| Flattening | Converting a multi-dimensional feature map into a 1D vector for fully connected layers in CNNs. |
| Activation function | A function (e.g., ReLU, sigmoid) that introduces non-linearity, determining neuron output. |
| Loss function | A measure of error between predicted and true values, guiding model optimization (e.g., MSE, cross-entropy). |
| Gradient descent | An optimization algorithm to minimize the loss by adjusting weights using gradients. |
| Stochastic Gradient Descent (SGD) | A variant of gradient descent using random batches or single samples for faster updates. |
| Learning rate | The step size in gradient descent, controlling how much weights are adjusted per update. |
| Dropout | A regularization technique that randomly deactivates neurons during training to prevent overfitting. |
| Kernel/Filter | A small matrix in CNNs used in convolution to detect specific patterns (e.g., edges, textures). |
| Stride | The step size a filter moves during convolution in a CNN. |
| Padding | Adding borders (e.g., zeros) to input data in CNNs to control output size after convolution. |
| Regularization | Techniques (e.g., L2, dropout) to prevent overfitting by penalizing complex models. |
| Hyperparameter | A user-defined setting (e.g., learning rate, batch size) that controls the training process. |
| Transfer learning | Using a pre-trained model (e.g., on ImageNet) as a starting point for a new task. |
| Data augmentation | Artificially expanding a dataset (e.g., rotating images) to improve model robustness. |