Implement Logistic Regression with Deep Learning Mindset
In this exercise, we will implement a simple Logistic Regression model using neural network mindset (forward and backward propagation).
Logistic Regression is a supervised learning classification algorithm used to predict the probability of a target variable. In this classification, there would be only two possible classes.
For this implementation, we use Logistic Regression to classify cat and dog pictures (1 - cat vs. 0 - dog).
Data Processing
We use the subset of “Dogs vs. Cats” dataset available on Kaggle, which contains 25,000 images.
-
For the purpose of this exercise, we only use 500 images of dogs and cats for our traing data to reduce the training time.
-
The validation data will have 200 images of dogs and cats.
# Download the dataset
%%capture
!wget --no-check-certificate \
https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
-O /tmp/cats_and_dogs_filtered.zip
# Import all the packages
import os
import zipfile
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import copy
# Extract the dataset
local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()
Get all the do and cats directories.
base_dir = '/tmp/cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
# Directory with our training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')
# Directory with our training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')
# Directory with our validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')
# Directory with our validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')
The images will be converted to numpy array:
-
The numpy array with dimension (number of images, number width pixels, number of height pixels).
-
All the images will be resize to 128x128 pixels. This is important, we need to keep the dimensions consistently to eliminate bugs when feeding them into the model.
def img_to_array(folder_path, resize_shape = None, n_images = 1000):
"""
Covert image to numpy array with shape (num_images, num_px_width * num_px_height *num_channel)
Arguments:
folder_path -- path to folder that contains images
resize_shape -- resize image to the desired shape
n_images -- number of images to be converted
Returns:
arr - numpy array with shape (num_images, num_px_width * num_px_height *num_channel)
"""
file_names = os.listdir(folder_path)
img_array = []
for i in range(len(file_names)):
file_name = file_names[i]
# Take only n_images for each type
if i >= n_images:
break
img = Image.open(os.path.join(folder_path, file_name))
if resize_shape != None:
img = img.resize(resize_shape)
img_array.append(np.asarray(img))
return np.asarray(img_array, dtype=np.float128)
# Resize images to the same size (128x128)
resize = (128,128)
# Convert images to numpy array
train_cat_images = img_to_array(train_cats_dir, resize, 250)
train_dog_images = img_to_array(train_dogs_dir, resize, 250)
val_cat_images = img_to_array(validation_dogs_dir, resize, 100)
val_dog_images = img_to_array(validation_cats_dir, resize, 100)
# Create labels for dataset, 1 for cat and 0 for dog
train_cat_labels = np.ones((train_cat_images.shape[0],1))
train_dog_labels = np.zeros((train_dog_images.shape[0],1))
val_cat_labels = np.ones((val_cat_images.shape[0],1))
val_dog_labels = np.zeros((val_dog_images.shape[0],1))
Let’s take a look at our training set to get a better sense of what dogs and cats datasets look like. We will display 8 images for each type.
# Create figure
fig = plt.figure(figsize=(16,16))
# Configure display parameters
rows = 4
cols = 4
total = rows*cols
for i in range(int(total/2)):
fig.add_subplot(rows, cols, i + 1)
plt.imshow(train_cat_images[i].astype(np.uint8))
plt.title("Cat")
plt.axis('off')
fig.add_subplot(rows, cols, int(total/2) + i + 1)
plt.imshow(train_dog_images[int(total/2) + i].astype(np.uint8))
plt.title("Dog")
plt.axis('off')
Check again the shape of datasets.
print("train_cats_images shape", train_cat_images.shape)
print("train_cat_labels shape", train_cat_labels.shape)
print()
print("train_dogs_images shape", train_dog_images.shape)
print("train_dog_labels shape", train_dog_labels.shape)
print()
print("val_cats_images shape", val_cat_images.shape)
print("val_cat_labels shape", val_cat_labels.shape)
print()
print("val_dogs_images shape", val_dog_images.shape)
print("train_dog_labels shape", train_dog_labels.shape)
train_cats_images shape (250, 128, 128, 3)
train_cat_labels shape (250, 1)
train_dogs_images shape (250, 128, 128, 3)
train_dog_labels shape (250, 1)
val_cats_images shape (100, 128, 128, 3)
val_cat_labels shape (100, 1)
val_dogs_images shape (100, 128, 128, 3)
train_dog_labels shape (250, 1)
Next step, we put cats and dogs datasets together and shuffle them.
train_set_x = np.concatenate((train_cat_images, train_dog_images))
train_set_y = np.concatenate((train_cat_labels, train_dog_labels))
val_set_x = np.concatenate((val_cat_images, train_dog_images))
val_set_y = np.concatenate((val_cat_labels, train_dog_labels))
from sklearn.utils import shuffle
# Shuffle the joint dataset
train_set_x, train_set_y = shuffle(train_set_x, train_set_y, random_state=0)
val_set_x, val_set_y = shuffle(val_set_x, val_set_y, random_state=0)
The training and validation datasets has three channels (red, green, and blue) for each image. We make it flat so that each pixel is a feature. Each flattened image was also stacked vertically, as shown in the image above.
# Flatten images and set the shape to (n_features, m_set)
X_train = train_set_x.reshape(train_set_x.shape[0], -1).T
Y_train = train_set_y.T
X_val = val_set_x.reshape(val_set_x.shape[0], -1).T
Y_val = val_set_y.T
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_val shape: ", X_val.shape)
print("Y_val shape: ", Y_val.shape)
X_train shape: (49152, 500)
Y_train shape: (1, 500)
X_val shape: (49152, 350)
Y_val shape: (1, 350)
We standardize the datasets before feeding to the neural network (model). For image datasets, all the images are in range [0,255], so we only need to divide the images to 255 (the maximum value of a pixel).
# Standardize by divide the dataset by 255
X_train = X_train / 255.
X_val = X_val / 255.
Logistic Regression Model Implemetation
Model Architecture
We will build a Logistic Regression model using a Neural Network mindset. This is a simple neural network shown as below.
The figure above is a mathematical expression of the algorithm with one example $x^{(i)}$. We have
\[\begin{align} z^{(i)}&=w^Tx^{(i)}+b\\ \widehat{y}^{(i)}&=a^{(i)}=\sigma\left(z^{(i)}\right)\qquad(\sigma=\text{sigmoid})\\ \mathcal{L}\left(a^{(i)},y^{(i)}\right)&=-y^{(i)}\log\left(a^{(i)}\right)-\left(1-y^{(i)}\right)\log\left(1-a^{(i)}\right)\\ \end{align}\]with
\[\begin{align} &z^{(i)}&&\text{a linear equation with weight $w$ and bias $b$}\\ &\widehat{y}^{(i)},a^{(i)}&&\text{result of sigmoid function apply to $z^{(i)}$, (estimated value)}\\ &\mathcal{L}\left(a^{(i)},y^{(i)}\right)&&\text{cost function between estimated value and ground truth}\\ \end{align}\]The cost is the computed by averaging all costs of entire set:
\[J=\frac{1}{m}\sum_{i=1}^m\mathcal{L}\left(a^{(i)},y^{(i)}\right)\]We will talk more about cost function in forward propagation later.
Building Parts of The Algorithm
1. Sigmoid Function
In machine learning, the term sigmoid function is normally used to refer specifically to the logistic function, also called the logistic sigmoid function. The use of a sigmoid function is to convert a real value into one that can be interpreted as a probability
All sigmoid functions have the property that they map the entire number line into a small range such as between 0 and 1, or -1 and 1, so one use of a sigmoid function is to convert a real value into one that can be interpreted as a probability. (Click for more)
\[\sigma(z)=\frac{1}{1 + e^{-z}}\]# sigmoid
def sigmoid(z):
"""
Compute the sigmoid of z
Arguments:
z -- A scalar of numpy array
Return:
s -- sigmoid of z
"""
return 1/(1 + np.exp(-z))
2. Parameters Initialization
We will initialize weight $w$ as a vector of zeros with shape (num_of_feature, 1) and bias $b$ with zero.
# Initialize paramenters
def parameters_initialize(dim):
"""
Create a vector of zeroes with the shape (dim, 1) for parameter w and initializes b to 0
Argument:
dim -- size of the w vector
Returns:
w -- initialized vector of zeroes with the shape (dim, 1)
b -- initialized scalar (bias) with float type
"""
w = np.zeros((dim, 1))
b = 0.0
return w, b
3. Forward and Backward Propagation
3.1 Forward Propagation
In forward propagation, we will implement these mathematical expressions:
\[\begin{align} z&=w^TX+b\\ \widehat{y}&=a=\sigma\left(z\right)\\ \end{align}\]In logistic regression, we don’t use MSE L2 (Mean Square Error) as in linear regression. Because the model predictions is non-linear (sigmoid transformation). If we use MSE $\mathcal{L}\left(a^{(i)},y^{(i)}\right)=1/2(\widehat{y}-y)^2$, the result is a non-convex funtion with many local minimums. Our gradient descent might not find the optimal global minimum.
Instead, we use a cost function called Cross-entropy as shown below:
\[\mathcal{L}\left(\widehat{y},y\right)=-\left(y\log\widehat{y}+(1-y)\log(1-\widehat{y})\right)\]We want the cost to be optimized as small as possible so that $y$ and $\widehat{y}$ close together.
\[\begin{cases} \begin{align} &y=1&&-\mathcal{L}\left(\widehat{y},y\right)=-\log\widehat{y}&&&\leftarrow\widehat{y}\text{ need to be large}\\ &y=0&&-\mathcal{L}\left(\widehat{y},y\right)=-\log(1-\widehat{y})&&&\leftarrow\widehat{y}\text{ need to be small}\\ \end{align} \end{cases}\]The cost is the computed by averaging all costs of entire set:
\[\begin{align} J(w,b)&=\frac{1}{m}\sum_{i=1}^m\mathcal{L}\left(\widehat{y}^{(i)},y^{(i)}\right)\\ &=-\frac{1}{m}\sum_{i=1}^my^{(i)}\log\widehat{y}^{(i)}+(1-y^{(i)})\log(1-\widehat{y}^{(i)}) \end{align}\]3.2 Backward Propagation
We have cost function from the forward propagation. To minimize it, we use Gradient Descent to update $w$ and $b$ until we find the global minimum (minimum cost).
To do backward propagation, we’re going to find the partial derivative of $\mathcal{L}$ w.r.t $w$ and $b$. It means we need to find:
\[\frac{\partial{\mathcal{L}}}{\partial{w}}\quad\text{and}\quad\frac{\partial{\mathcal{L}}}{\partial{b}}\]By the Chain Rule in Calculus
\[\begin{align} \frac{\partial{\mathcal{L}}}{\partial{w}}&=\frac{\partial{\mathcal{L}}}{\partial{a}}\times\frac{\partial{a}}{\partial{z}}\times\frac{\partial{z}}{\partial{w}}\\ \\ \frac{\partial{\mathcal{L}}}{\partial{b}}&=\frac{\partial{\mathcal{L}}}{\partial{a}}\times\frac{\partial{a}}{\partial{z}}\times\frac{\partial{z}}{\partial{b}}\\ \end{align}\]The cost function is
\[\mathcal{L}=-y\log(a)+(1-y)\log(1-a)\]So
\[\begin{align} \frac{\partial{\mathcal{L}}}{\partial{a}}&=-y\frac{1}{a}-(1-y)\frac{1}{1-a}-1\\ &=\frac{-y}{a}+\frac{1-y}{1-a}\\ &=\frac{-y(1-a)+a(1-y)}{a(1-a)}\\ &=\frac{a-y}{a(1-a)}\qquad(1) \end{align}\]We know that derivative of a sigmoid is:
\[\frac{\partial{\sigma(z)}}{\partial{z}}=\sigma(z)(1-\sigma(z))\\ \implies \frac{\partial{a}}{\partial{z}}=a(1-a)\qquad(2)\\\] \[\begin{align} \frac{\partial{z}}{\partial{w}}&=x\qquad(3)\\ \frac{\partial{z}}{\partial{b}}&=1\qquad(4)\\ \end{align}\]Now we can calculate the partial derivative of $\mathcal{L}$ w.r.t $w$ and $b$
\[\begin{align} \color{red}{\frac{\partial{\mathcal{L}}}{\partial{w}}}&=(1)\times(2)\times(3)\\ &=\frac{a-y}{a(1-a)}\times a(1-a)\times x\\ &=\color{red}{(a-y)x}\\ \\ \color{red}{\frac{\partial{\mathcal{L}}}{\partial{b}}}&=(1)\times(2)\times(4)\\ &=\frac{a-y}{a(1-a)}\times a(1-a)\times 1\\ &=\color{red}{(a-y)} \end{align}\]To perform an update for $w$ and $b$ we do the following steps
\[w:=w-\alpha\frac{\partial{\mathcal{L}}}{\partial{w}}\\ b:=b-\alpha\frac{\partial{\mathcal{L}}}{\partial{b}}\\ (\alpha\text{ is learning rate})\]We have all we need to implement forward and backward propagation.
# Forward and backward propagation
def propagate(X, Y, w, b):
"""
Implement cost function and its gradient for forward and backward propagation
Arguments:
X -- training data with dim (num_px_width * num_px_height *num_channel, number of examples)
Y -- true label vector (1 if cat, 0 if dog) with dim (1, number of example)
w -- weights, a numpy array with size (num_px_width * num_px_height *num_channel, 1)
b -- bias, a scalar
Return:
cost -- log-likelihood cost
dw -- gradient of the loss w.r.t w (same shape as w)
db -- gradient of the loss w.r.t b (same shape as b)
"""
m = X.shape[1]
# FORWARD PROPAGATION
A = sigmoid(np.dot(w.T, X) + b)
# Add a very small epsilon to avoid log(0)
epsilon = 1e-5
cost = -1/m * np.sum((Y*np.log(A + epsilon) + (1 - Y)*np.log(1 - A + epsilon)))
# BACWARD PROPAGATION
# divide by m because the cost J was computed by averaging the entire set.
dw = 1/m * np.dot(X, (A - Y).T)
db = 1/m * np.sum(A - Y)
cost = np.squeeze(np.array(cost))
grads = {"dw":dw, "db": db}
return cost, grads
The propagate
function computes the cost and gradient $\partial{w}$ and $\partial{b}$ for 1 iteration through entire training examples.
To find the global minimum of the cost, we will loop through $n$ interations and update $w$ and $b$ by $\partial{w}$ and $\partial{b}$, the optimize
function below will propagate through training examples $n$ times.
def optimize(X, Y, w, b, n_iterations = 100, learning_rate = 0.001, print_log=False):
"""
Update w and b using gradient descent
Arguments:
X -- training data with dim (num_px_width * num_px_height *num_channel, number of examples)
Y -- true label vector (1 if cat, 0 if dog) with dim (1, number of example)
w -- weights, a numpy array with size (num_px_width * num_px_height *num_channel, 1)
b -- bias, a scalar
n_iterations -- number of iterations through the data examples
learning_rate -- step size for each iteration
print_log -- print the loss every 100 steps
Returns:
params -- dictionary contains w and b
grads -- dictionary contains dw and db w.r.t to the cost function
costs -- list all costs during training
"""
# Make a deep copy before update w and b
w = copy.deepcopy(w)
b = copy.deepcopy(b)
costs = []
for i in range(n_iterations):
cost, grads = propagate(X, Y, w, b)
# Retrieve dw and db
dw = grads['dw']
db = grads['db']
# Update parameters w and b
w = w - learning_rate*dw
b = b - learning_rate*db
# Record cost and print cost every 100 interations
if i%500 == 0:
costs.append(cost)
if print_log:
print ("Iteration: %i - Cost: %f" %(i, cost))
params = {
'w': w,
'b': b
}
grads = {
'dw': dw,
'db': db,
}
return params, grads, costs
So now we use the optimized $w$ and $b$ to get predictions from the model.
def predict(X, w, b):
"""
Predict data X with label 1 (cat) or 0 (dog) using learn parameters w and b
Arguments:
X -- data with dim (num_px_width * num_px_height *num_channel, number of examples)
w -- weights, a numpy array with size (num_px_width * num_px_height *num_channel, 1)
b -- bias, a scalar
Return:
Y_predictions -- a numpy array contains predictions for the data X
"""
A = sigmoid(np.dot(w.T, X) + b)
Y_predictions = A > 0.5
return Y_predictions
Build a Logistic Regression Model by merging all functions
1. Merging all functions
Now we have the overall structure of the model, it’s time to put them all together (all the functions implemented in the building parts).
class LogisticRegressionModel():
"""
Regression Logistic Model using forward and backward propagation.
"""
def __init__(self):
# w: weights, a numpy array with size (num_px_width * num_px_height *num_channel, 1)
self.w = None
# b: bias, a scalar
self.b = None
self.costs = []
self.learning_rate = 0.001
self.n_iterations = 1000
self.print_log = False
def parameters_initialize(self, dim):
"""
Create a vector of zeroes with the shape (dim, 1) for parameter w and initializes b to 0
Argument:
dim -- size of the w vector
Returns:
w -- initialized vector of zeroes with the shape (dim, 1)
b -- initialized scalar (bias) with float type
"""
w = np.zeros((dim, 1))
b = 0.0
return w, b
def propagate(self, X, Y):
"""
Implement cost function and its gradient for forward and backward propagation
Arguments:
X -- training data with dim (num_px_width * num_px_height *num_channel, number of examples)
Y -- true label vector (1 if cat, 0 if dog) with dim (1, number of example)
Returns:
cost -- log-likelihood cost
dw -- gradient of the loss w.r.t w (same shape as w)
db -- gradient of the loss w.r.t b (same shape as b)
"""
m = X.shape[1]
# Forward propagation
A = sigmoid(np.dot(self.w.T, X) + self.b)
# Add a very small epsilon to avoid log(0)
epsilon = 1e-5
cost = -1/m * np.sum((Y*np.log(A + epsilon) + (1 - Y)*np.log(1 - A + epsilon)))
# Backward propagation
dw = 1/m * np.dot(X, (A - Y).T)
db = 1/m * np.sum(A - Y)
cost = np.squeeze(np.array(cost))
grads = {"dw":dw, "db": db}
return cost, grads
def optimize(self, X, Y):
"""
Update w and b using gradient descent
Arguments:
X -- training data with dim (num_px_width * num_px_height *num_channel, number of examples)
Y -- true label vector (1 if cat, 0 if dog) with dim (1, number of example)
"""
for i in range(self.n_iterations):
cost, grads = self.propagate(X, Y)
# Retrieve dw and db
dw = grads['dw']
db = grads['db']
# Update parameters w and b
self.w = self.w - self.learning_rate*dw
self.b = self.b - self.learning_rate*db
# Record cost and print cost every 100 interations
if i%500 == 0:
self.costs.append(cost)
if self.print_log:
print ("Iteration: %i - Cost: %f" %(i, cost))
def predict(self, X):
"""
Predict data X with label 1 (cat) or 0 (dog) using learn parameters w and b
Arguments:
X -- data with dim (num_px_width * num_px_height *num_channel, number of examples)
Return:
Y_predictions -- a numpy array contains predictions for the data X
"""
A = sigmoid(np.dot(self.w.T, X) + self.b)
Y_predictions = A > 0.5
return Y_predictions
def train(self, X_train, Y_train, n_iterations = 2000, learning_rate = 0.001, print_log=False):
"""
Build regression logistic model
Arguments:
X_train -- training data with dim (num_px_width * num_px_height *num_channel, number of examples)
Y_train -- true training label vector (1 if cat, 0 if dog) with dim (1, number of example)
n_iterations -- number of iterations through the data examples
learning_rate -- step size for each iteration
print_log -- print the loss every n steps
"""
# Initialize parameter w and b
self.w, self.b = self.parameters_initialize(X_train.shape[0])
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.print_log = print_log
# Run forward and backward propagation to optimze w and b
self.optimize(X_train, Y_train)
def evaluate(self, X, Y):
preds = self.predict(X)
result = {"accuracy": (100 - np.mean(np.abs(preds - Y))*100)}
return result
2. Train and Evaluate Model
Let’s train our processed dataset in previous section.
model = LogisticRegressionModel()
model.train(X_train, Y_train, n_iterations= 3000, learning_rate=0.001, print_log=True)
Iteration: 0 - Cost: 0.693127
Iteration: 500 - Cost: 0.623720
Iteration: 1000 - Cost: 0.418507
Iteration: 1500 - Cost: 0.346329
Iteration: 2000 - Cost: 0.306877
Iteration: 2500 - Cost: 0.275673
Evaluate both training set and validation set.
print(model.evaluate(X_val, Y_val))
print(model.evaluate(X_train, Y_train))
{'accuracy': 81.42857142857143}
{'accuracy': 97.0}
Training accuracy is 97% and validation accuracy is 81%.
-
The model is just a simple neural network, but it has high enough capacity to fit the training data.
-
The model’s accuracy for training data is close to 100%. That means our model is overfitting. When evaluate with validation data, the accuracy dropped to 81%.
Let plot the learning curve with the cost function and the gradients.
plt.plot(model.costs)
plt.ylabel('cost')
plt.xlabel('iterations (per 500 hundreds)')
plt.title("Learning rate =" + str(model.learning_rate))
plt.show()
We can see that the cost is declining, and we could train the model more accurately by running more iterations. This occurs as a result of the model’s attempt to match the parameters to this specific training set of data. But when evaluating with unforeseen data, accuracy will drop. Overfitting is what we refer to as in this situation.