ML Basics
Basic Theory of ML modules
This notebook contains a roadmap for learning theoretical basis of main ML modules and approaches, and some excercises for evaluating your self. Hope to enjoy and learn :)
First of all, please see the starting four sessions of famous Andrew Ng's ML course. Then, try to answer the following questions.
Consider the linear regression $\hat{y}=w^T x$ on $S=\{(x^{(i)}, y^{(i)})\}_{i=1}^m$ with loss function $J(w)=\sum_{i=1}^{m} (y^{(i)}-\hat{y}^{(i)})^2$.
Answer the following questions:
Prove that $$\frac{d\sigma(a)}{da}=\sigma(a)(1-\sigma(a)).$$
In logistic regression, we have $p(C_1|x)=\sigma(w^T x)$. Compute the negative log-likelihood for dataset $\{(x^{(1)}, y^{(1)}), ..., (x^{(n)}, y^{(n)})\}$.
Show that by computing gradient of the previous part w.r.t. $w$, we have $\sum_{i=1}^{n}(y^{(i)}-\hat{y}^{(i)})x^{(i)}$. compare this with MSE regression gradients.
show that $$\log{\frac{p(C_1|x)}{p(C_0|x)}}=w_1^T x+w_1'$$. Generalize it to $k$ classes and see the Softmax formula.
(optional) if $$L=-\sum_i y_i\log{p_i}$$ where $p_i=p(C_i|x)$, show that $\nabla_O L(x)=y-p$, where $y$ is label one-hot vector for $x$, $p$ is the output of softmax where $p_i=p(C_i|x)$, and $$o_i=w_i^T x + w_i', 1\leq i\leq k.$$
(optional) In logistic regression for $K$ classes, the posterior probability is computed in the following way $$\begin{equation} \left\{ \begin{array}{@{}ll@{}} P(Y=k|X=x)=\frac{exp(w_k^T x)}{1+\sum_{l=1}^{K-1}exp(w_l^T x)}, & (k=1, ..., K-1) \\ P(Y=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}exp(w_l^T x)} \end{array}\right. \end{equation}$$ For simplicity, consider $w_K=0$.
How many parameters should be estimated. What are them?
Simplify the following log-likelihood for $n$ training samples $\{(x_1, y_1), ..., (x_n, y_n)\}$ $$L(w_1, ..., w_{K-1})=\sum_{i=1}^{n}\ln{P(Y=y_i|X=x_i)}$$
Compute and simplify the gradient of $L$ w.r.t. each of $w_k$s.
Consider the following objective function. Compute the gradient of $f$ w.r.t. each of $w_k$s.
The source for this topic is:
Consider the following network: $$z_1=W_1 x^{(i)}+b_1$$ $$a_1=ReLU(z_1)$$ $$z_2=W_1 x'^{(i)}+b_1$$ $$a_2=ReLU(z_2)$$ $$a=a_1-a_2$$ $$z_3=W_2 a+b_2$$ $$\hat{y}^{(i)}=\sigma(z_3)$$ $$L^{(i)}=y^{(i)}\log{\hat{y}}^{(i)}+(1-y^{(i)})\log{(1-\hat{y}^{(i)})}$$ $$J=-\frac{1}{m}\sum_{i=1}^{m}L^{(i)}$$ where inputs are $x^{(i)}\in\mathbb{R}^{d_x\times1}, x'^{(i)}\in\mathbb{R}^{d_x\times 1}$ and the output is $\hat{y}\in(0,1)$ (the label is $y^{(i)}\in\{0, 1\}$). also $a\in\mathbb{R}^{d_a\times 1}$. Compute the following:
$\frac{\partial J}{\partial z_3}$
$\frac{\partial z_3}{\partial a}$
$\frac{\partial a}{\partial z_1}, and \frac{\partial a}{\partial z_2}$
$\frac{\partial z_2}{\partial W_1}, and \frac{\partial z_1}{\partial W_1}$
$\frac{\partial J}{\partial W_1}$
Write down the formula for updating all weights based on gradient descent.
CNNs are maybe the most important module in image processing. They have some sort of inductive bias for extracting local features.
The source for this topic is:
Answer the following questions:
Complete the following table. padding and stride are equal to 1, unless explicitly states.
Layer | Output Shape | # of Parameters |
---|---|---|
Input | 128$\times$128$\times$3 | 0 |
CONV-9-32 | ||
POOL-2 | ||
CONV-5-64 | ||
POOL-2 | ||
CONV-5-64 | ||
POOL-2 | ||
FC-3 |
What is the number of parameters for replacing the fourth layer (CONV-5-64) with a fully-connected layer? What's your conclusion?
If we use a GPU with 12GB RAM for running the following network, what is the maximum number of pictures we could have in a batch? (you should find the memory bottleneck)
Input: 256 x 256
[64] Conv 3 x 3, s=1, p=1
[64] Conv 3 x 3, s=1, p=1
Pool 2 x 2, s=2, p=0
[128] Conv 3 x 3, s=1, p=1
[128] Conv 3 x 3, s=1, p=1
Pool 2 x 2, s=2, p=0
[256] Conv 3 x 3, s=1, p=1
[256] Conv 3 x 3, s=1, p=1
Pool 2 x 2, s=2, p=0
[512] Conv 3 x 3, s=1, p=1
[512] Conv 3 x 3, s=1, p=1
Pool 2 x 2, s=2, p=0
[512] Conv 3 x 3, s=1, p=1
[512] Conv 3 x 3, s=1, p=1 Pool 2 x 2, s=2, p=0
Flatten
FC (4096)
FC (4096)
FC (2)
Determine the receptive field of the neuron $(i, j)$ in the last convolution layer.
Build a neural network that gets two sequence of binary number and the output is a binary sequence that is a sum of the two (the length of the sequence is not fixed). For example:
Time | input 1 | input 2 | output |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 1 | 0 |
3 | 1 | 1 | 1 |
4 | 0 | 0 | 1 |
At time 1 the network gets the least significant bit and at time 4 it gets the most significant bit. Evaluate you NN on large sequences.
Generative models are deep models that generate fake data (image, text, video, etc.). We want that this fake-generated data seem real to humans and even other intelligent systems. The source for this topic is: