ML Basics

Basic Theory of ML modules

This notebook contains a roadmap for learning theoretical basis of main ML modules and approaches, and some excercises for evaluating your self. Hope to enjoy and learn :)

First of all, please see the starting four sessions of famous Andrew Ng's ML course. Then, try to answer the following questions.

Linear Regression

Consider the linear regression $\hat{y}=w^T x$ on $S=\{(x^{(i)}, y^{(i)})\}_{i=1}^m$ with loss function $J(w)=\sum_{i=1}^{m} (y^{(i)}-\hat{y}^{(i)})^2$.

  1. Simplify $\underset{w}{argmin}$ $J(w)$ by setting derivative of $J(w)$ w.r.t. $w$ to 0.
  2. When the formula of the previous part does not work? Simplify $\underset{w}{argmin}$ $J(w)+\lambda\lVert w \rVert^2$ and describe how this new formulat solves the problem.

Answer the following questions:

  1. Prove that $$\frac{d\sigma(a)}{da}=\sigma(a)(1-\sigma(a)).$$

  2. In logistic regression, we have $p(C_1|x)=\sigma(w^T x)$. Compute the negative log-likelihood for dataset $\{(x^{(1)}, y^{(1)}), ..., (x^{(n)}, y^{(n)})\}$.

  3. Show that by computing gradient of the previous part w.r.t. $w$, we have $\sum_{i=1}^{n}(y^{(i)}-\hat{y}^{(i)})x^{(i)}$. compare this with MSE regression gradients.

  4. show that $$\log{\frac{p(C_1|x)}{p(C_0|x)}}=w_1^T x+w_1'$$. Generalize it to $k$ classes and see the Softmax formula.

  5. (optional) if $$L=-\sum_i y_i\log{p_i}$$ where $p_i=p(C_i|x)$, show that $\nabla_O L(x)=y-p$, where $y$ is label one-hot vector for $x$, $p$ is the output of softmax where $p_i=p(C_i|x)$, and $$o_i=w_i^T x + w_i', 1\leq i\leq k.$$

(optional) In logistic regression for $K$ classes, the posterior probability is computed in the following way $$\begin{equation} \left\{ \begin{array}{@{}ll@{}} P(Y=k|X=x)=\frac{exp(w_k^T x)}{1+\sum_{l=1}^{K-1}exp(w_l^T x)}, & (k=1, ..., K-1) \\ P(Y=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}exp(w_l^T x)} \end{array}\right. \end{equation}$$ For simplicity, consider $w_K=0$.

  1. How many parameters should be estimated. What are them?

  2. Simplify the following log-likelihood for $n$ training samples $\{(x_1, y_1), ..., (x_n, y_n)\}$ $$L(w_1, ..., w_{K-1})=\sum_{i=1}^{n}\ln{P(Y=y_i|X=x_i)}$$

  3. Compute and simplify the gradient of $L$ w.r.t. each of $w_k$s.

  4. Consider the following objective function. Compute the gradient of $f$ w.r.t. each of $w_k$s.

$$f(w_1, ..., w_{K-1})=L(w_1, ..., w_{K-1})-\frac{\lambda}{2}\sum_{l=1}^{K-1}\lVert w_l\rVert_2^2$$

Backpropagation

The source for this topic is:

  • SPML course in Fall 2021
    • session 3 from min 100
    • session 4
    • session 5 upto min 45

Questions

Consider the following network: $$z_1=W_1 x^{(i)}+b_1$$ $$a_1=ReLU(z_1)$$ $$z_2=W_1 x'^{(i)}+b_1$$ $$a_2=ReLU(z_2)$$ $$a=a_1-a_2$$ $$z_3=W_2 a+b_2$$ $$\hat{y}^{(i)}=\sigma(z_3)$$ $$L^{(i)}=y^{(i)}\log{\hat{y}}^{(i)}+(1-y^{(i)})\log{(1-\hat{y}^{(i)})}$$ $$J=-\frac{1}{m}\sum_{i=1}^{m}L^{(i)}$$ where inputs are $x^{(i)}\in\mathbb{R}^{d_x\times1}, x'^{(i)}\in\mathbb{R}^{d_x\times 1}$ and the output is $\hat{y}\in(0,1)$ (the label is $y^{(i)}\in\{0, 1\}$). also $a\in\mathbb{R}^{d_a\times 1}$. Compute the following:

  1. $\frac{\partial J}{\partial z_3}$

  2. $\frac{\partial z_3}{\partial a}$

  3. $\frac{\partial a}{\partial z_1}, and \frac{\partial a}{\partial z_2}$

  4. $\frac{\partial z_2}{\partial W_1}, and \frac{\partial z_1}{\partial W_1}$

  5. $\frac{\partial J}{\partial W_1}$

  6. Write down the formula for updating all weights based on gradient descent.

CNNs

CNNs are maybe the most important module in image processing. They have some sort of inductive bias for extracting local features.

The source for this topic is:

  • SPML course in Fall 2021
    • session 5 from min 45
    • session 6

Questions

Answer the following questions:

  1. Describe the Sparsity of Connections property of CNNs.
  2. Describe the Parameter Sharing property of CNNs.
  3. Consider an input of shape $63\times 63\times 16$. If stride=2 and padding=0, compute the shape of the output if we have 32 $7\times 7$ kernels.
  4. Name three advantages of using CNN over MLP.
  5. Consider a CNN network, which is trained on ImageNet. Is the output probability of the network uniform overall classes if the input is a white picture? Why?

Complete the following table. padding and stride are equal to 1, unless explicitly states.

  • CONVx-N is a N filter convolution layer with height and width equal to x.
  • POOL-N is a MAX Pooling of the shape $N\times N$, with Stride=$N$, and Padding=$0$.
  • FC-N is a fully-connected layer with N neurons.
Layer Output Shape # of Parameters
Input 128$\times$128$\times$3 0
CONV-9-32
POOL-2
CONV-5-64
POOL-2
CONV-5-64
POOL-2
FC-3

What is the number of parameters for replacing the fourth layer (CONV-5-64) with a fully-connected layer? What's your conclusion?

If we use a GPU with 12GB RAM for running the following network, what is the maximum number of pictures we could have in a batch? (you should find the memory bottleneck)

Input: 256 x 256

[64] Conv 3 x 3, s=1, p=1

[64] Conv 3 x 3, s=1, p=1

Pool 2 x 2, s=2, p=0

[128] Conv 3 x 3, s=1, p=1

[128] Conv 3 x 3, s=1, p=1

Pool 2 x 2, s=2, p=0

[256] Conv 3 x 3, s=1, p=1

[256] Conv 3 x 3, s=1, p=1

Pool 2 x 2, s=2, p=0

[512] Conv 3 x 3, s=1, p=1

[512] Conv 3 x 3, s=1, p=1

Pool 2 x 2, s=2, p=0

[512] Conv 3 x 3, s=1, p=1

[512] Conv 3 x 3, s=1, p=1 Pool 2 x 2, s=2, p=0

Flatten

FC (4096)

FC (4096)

FC (2)

Determine the receptive field of the neuron $(i, j)$ in the last convolution layer.

RNNs

  • Stanford Natural Language Processing with Deep Learning course | Winter 2021
    • sessions 5, 6, and 7
  • AI-Med internship videos
    • sessions 6 and 7

Build a neural network that gets two sequence of binary number and the output is a binary sequence that is a sum of the two (the length of the sequence is not fixed). For example:

Time input 1 input 2 output
1 0 1 1
2 1 1 0
3 1 1 1
4 0 0 1

At time 1 the network gets the least significant bit and at time 4 it gets the most significant bit. Evaluate you NN on large sequences.

Generative Models

Generative models are deep models that generate fake data (image, text, video, etc.). We want that this fake-generated data seem real to humans and even other intelligent systems. The source for this topic is:

  • AI-Med internship videos
    • session 8
Authors
Mahdi Ghaznavi
Author