ML Course Notes 01

Week01 & 02

Note 这是coursera上最经典的课程,很基础但是是一个很有条理的ML入门知识梳理,上面有随堂的和单元quiz,还有很有趣的Octave编程题。预计一天刷一个week的课程,大概半个月把这个课再刷一遍

这个笔记系列融合了我对这门课,sklearn文档和西瓜书的一些理解和记录。由于翻译实在是比较麻烦,就直接记英文了

Lecture 01: Introduction

Tom Mitchell (1998) define ML as follows: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

pass ;

Lecture 02: Linear regression with one variable

Some terms:

  • Hypothesis
  • Parameters
  • Cost function(Loss VS cost)
  • Goal

Gradient descent

  • Outline: Start with some Parameters, keep changing them to reduce cost until we hopefully end up at a minimum
  • Algorithm
     repeat until convergence{
         for θ_0,θ_1{
             new_θ_j = θ_j - α * \partial{J(θ_0,θ_1)}{θ_j}
         }
         simultaneously update θ_0,θ_1;
     }
    
    • α: learning rate
    • J: Cost function
    • Simultaneous update is required.(means all the parameters update together at the end of this batch)
    • As we approach a local minimum, gradient descent will automatically take smaller steps, so no need to decrease α over time.
    • Batch: Each step of gradient descent uses all the training samples

 

Lecture 03: Linear Algebra review

pass ;

Lecture 04: Linear Regression with multiple variables

Gradient descent for multiple variables (I think it’s almost the same)

  • Algorithm
     repeat until convergence{
         for all θ{
             new_θ_j = θ_j - α * \partial{J(θ)}{θ_j}
         }
         simultaneously update all θ;
     }
    
  • Scaling
    • Idea: Make sure features are on a similar scale.
    • Feature Scaling:Get every feature(attribute value) into approximately at [-1,1].
    • Mean normalization: Get every feature(attribute value) have approximately zero mean(by sub a number)
  • Learning rate
    • Making sure gradient descent is working correctly:
      • For sufficiently small α, cost should decrease every iteration(batch).On a another way, gradient descent can be slow to converge if α is too small.
      • Declare convergence if J(θ) decreases by less than 10^-3 in one iteration(batch)
    • Summery:
      • If α is too small: slow convergence
      • If α is too large:J(θ) may not decrease on every iteration; may not converge.
      • At the beginning, Try 0.001,0.01,0.1,1…

Features and polynomial regression

  • I think: It is like features crossing with themselves to get new features, which might be helpful.
  • Feature choosing is also important

Normal equation: Method to solve for θ analytically.

  • No need to choose α
  • No need to iterate
  • Need to compute (X.T*X)^-1
  • Slow if n is very large

What if (X.T*X) is non-invertible?(singular/degenerate)

  • Redundant features (linearly dependent)
  • Too many features
    • Remove some of them, or use regularization.

Lecture 05: Octave Tutorial

pass;

Programming Assignment: Linear Regression

Pretty interesting. It is my first time to call the included submit function from Octave to submit this assignment, which is a cool way hardly seen in Chinese university.

Accroding to Coursera Honor Code, I may not able to share my solutions to homework, quizzes, or exams with anyone else unless explicitly permitted by the instructor. So the codes here are removed and the github repository are made private.

本文总字数: 2602