From_Machine-Learning_to_Deep-Learning

logistic classifier(linear classifier)

wx + b = y(w,b,x,y are all vector)

softmax

Score(y_i) = rac{e_{yi}}{\sum_j {e_{yj}}}

turn scores into probabilities

def softmax(x):
    return numpy.exp(x) / numpy.sum(numpy.exp(x),axis = 0)

注意numpy */ 是按元素进行乘除,矩阵乘法用dot

one-hot encoding

L(yi):1.0 for right label yi, and 0 for others

  1. using embeddings to handle big scale sparse matrix
  2. measure how well we are doing by comparing two vectors(probabilities and vector after one-hot encoding)

cross entropy

to measure the distance between probabilities and vector after one-hot encoding

D(S,L) = - \sum_i{(Li * log(Si))}

Be careful,the cross entropy is not symmetric, D(S,L) != D(L,S)

Multinomial Logistic Classification

image

Aim

  1. have a high cross entropy distance for incorrect label, but have a low distance for correct label
  2. to minimize the training loss

Trainning loss

The average of cross entropy distance of all instances in training sets

it is a humongous function image

Normalized Inputs and Initial Weights and Bias

1. Normalized Inputs

To avoid lost accuracy

when dealing with Images

(r - 128)/128,(g - 128)/128,(b - 128)/128, image

2. A simple way to picking initial weights

To avoid lost accuracy and fasten the process

a possible way - gradient descent

to compute the derivatives and finally reach the minimum.

Question: (where the optimizer is still in black-box)

  1. how do you fill the pixels into classifier
  2. where do you initialize the optimazation

validation,test size - rule of thumb 30

a change that affects 30 examples in your validation set, one way or not,is usually statistically significant and typically can be trusted,

the following picture descript which accuracy change is acceptable by rule of thumb 30 image

when validation set size > 30000, change > 0.1% is acceptable

cross-entropy is one way to mix this problem, but it is too slow,

get more data is often the right solution

Optimizating a logistic classifier using gradient descent

the important Question is that how to scale gradient descent

Stochastic Gradient Descent SGD

To reduce running time(calculate entire gradient takes a lot running time), instead of taking large step and calculate the gradient of all training sets, taking small step and calculate small random fraction of the training sets(an estimate of actual gradient)

image

image

Helping SGD

initial

  1. input
    • zero means
    • equal variable(SMALL)
  2. initial weights
    • random
    • zero means
    • equal variable(SMALL)

Momentum(动量)

Cause:

  1. each step toward a random direction
  2. on aggregate, we toward the minimum loss

use w = 0.9w + gradient instead of gradient as the direction of each step

image

Learning Rate decay

  1. exponential
  2. reach the plateau

image

Parameter Hyperspace

image

Intro to Deep Neural Network

  1. Regulation:make us enable to increase the scale of dataset
  2. If you have N inputs, and K outputs, the complexity of the simple model above is (N+1)Kimage
  3. linear model is limited image
  4. linear model is cheap(GPU) and good,we can show that small change in input can never yield big change in output
  5. the derivative of the model is constant and stable.
  6. we'd like to keep our model inside big linear models but we also want the model to keep nonlinear

Rectified Linear Unit(RELU)

a nonlinear function lazy engineer perfered

y = x > 0?x:0

Add a RELU into Linear Function,the model is now nonlinear image H is the number of RELU units we inserted

The first Neural Network

  1. The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
  2. The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.

Chain Rule

For h(x) = g(f(x)),h'(x) = g'(f(x)) * f'(x) - we can compute the derivative of h by taking the product of the derivative of different component - It's easy to compute the derivative of whole function by lots of data reuse and pipeline - So we can use a lot of simple funtion to build the model so that the deep learning model can handle it

Back-propagation

image - As long as the function is made up by simple block with simple derivatives, the learning framework help you work out the derivative by simple block - It makes computing complex derivatives very effective

image - Forward Prop:Running the model up to the prediction - Back Prop:the model that goes backward and compute derivative image - The Back Prop always take twice compute complexity and memory complexity than Forward Prop