logistic classifier(linear classifier)

wx + b = y(w,b,x,y are all vector)

just a giant matrix multiply
x is input vector
w weights
b biased term
scores in logistic classifier is often called logits

softmax

Score(y_i) = rac{e_{yi}}{\sum_j {e_{yj}}}

turn scores into probabilities

def softmax(x):
    return numpy.exp(x) / numpy.sum(numpy.exp(x),axis = 0)

注意numpy */ 是按元素进行乘除，矩阵乘法用dot

if you increase the size of input(x), the classifier will be confident about the output

one-hot encoding

L(yi):1.0 for right label yi, and 0 for others

using embeddings to handle big scale sparse matrix
measure how well we are doing by comparing two vectors(probabilities and vector after one-hot encoding)

cross entropy

to measure the distance between probabilities and vector after one-hot encoding

D(S,L) = - \sum_i{(Li * log(Si))}

Be careful,the cross entropy is not symmetric, D(S,L) != D(L,S)

Multinomial Logistic Classification

Aim

have a high cross entropy distance for incorrect label, but have a low distance for correct label
to minimize the training loss

Trainning loss

The average of cross entropy distance of all instances in training sets

it is a humongous function

Normalized Inputs and Initial Weights and Bias

1. Normalized Inputs

To avoid lost accuracy

mean(x) == e
equal variance(xi方差相等)

when dealing with Images

(r - 128)/128,(g - 128)/128,(b - 128)/128,

2. A simple way to picking initial weights

To avoid lost accuracy and fasten the process

draw the weight randomly from a Gaussian Distribution with mean zero and standard deviation sigma
- the sigma values determines the order of magnitude(数量级) of your outputs at initial point
- a large sigma will make the distribution have large peaks, which going to be very opinionated
- a small sigma will going to be less confident
it is better to begin with an uncertain distribution and let it become more confident during progress

a possible way - gradient descent

to compute the derivatives and finally reach the minimum.

Question: (where the optimizer is still in black-box)

how do you fill the pixels into classifier
where do you initialize the optimazation

validation,test size - rule of thumb 30

a change that affects 30 examples in your validation set, one way or not,is usually statistically significant and typically can be trusted,

the following picture descript which accuracy change is acceptable by rule of thumb 30

when validation set size > 30000, change > 0.1% is acceptable

cross-entropy is one way to mix this problem, but it is too slow,

get more data is often the right solution

Optimizating a logistic classifier using gradient descent

the important Question is that how to scale gradient descent

Stochastic Gradient Descent SGD

To reduce running time(calculate entire gradient takes a lot running time), instead of taking large step and calculate the gradient of all training sets, taking small step and calculate small random fraction of the training sets(an estimate of actual gradient)

a terrible estimate in fact
but it works in practice,comes with a lots of issues
needs to pick very randomly

Helping SGD

initial

input
- zero means
- equal variable(SMALL)
initial weights
- random
- zero means
- equal variable(SMALL)

Momentum(动量)

Cause:

each step toward a random direction
on aggregate, we toward the minimum loss

use w = 0.9w + gradient instead of gradient as the direction of each step

Learning Rate decay

exponential
reach the plateau

Parameter Hyperspace

Intro to Deep Neural Network

Regulation:make us enable to increase the scale of dataset
If you have N inputs, and K outputs, the complexity of the simple model above is (N+1)K
linear model is limited
linear model is cheap(GPU) and good,we can show that small change in input can never yield big change in output
the derivative of the model is constant and stable.
we'd like to keep our model inside big linear models but we also want the model to keep nonlinear

Rectified Linear Unit(RELU)

a nonlinear function lazy engineer perfered

y = x > 0?x:0

Add a RELU into Linear Function,the model is now nonlinear H is the number of RELU units we inserted

The first Neural Network

The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.

Chain Rule

For h(x) = g(f(x)),h'(x) = g'(f(x)) * f'(x) - we can compute the derivative of h by taking the product of the derivative of different component - It's easy to compute the derivative of whole function by lots of data reuse and pipeline - So we can use a lot of simple funtion to build the model so that the deep learning model can handle it

Back-propagation

- As long as the function is made up by simple block with simple derivatives, the learning framework help you work out the derivative by simple block - It makes computing complex derivatives very effective

- Forward Prop:Running the model up to the prediction - Back Prop:the model that goes backward and compute derivative - The Back Prop always take twice compute complexity and memory complexity than Forward Prop