Multilayer Perceptrons
Goal of Deep Learning
Deep neural networks combine compositionality (depth) with expansiveness (width) to learn representations.
- Expansiveness: neural networks can map simple input data space to high dimensional feature space:
where
- Compositionality: neural networks compose maps to make information more accessible and explicit with increasing depth.
Universality
Linear Maps
A linear map can be simply parameterized by a weight matrix:
However, linear maps are closed under composition,
Ridge Functions
Ridge functions introduce non-linearity, which combine linear maps with activation functions:
where
universal approximation theorem
Neural networks with one hidden layer and a non-polynomial activation are universal function approximators.
Activation Functions
- Logistic function (sigmoid):
, which maps to . - Hyperbolic tangent:
, which maps to . - ReLU:
.
Classical Multilayer Perceptron
A classical single-hidden layer MLP can be represented by
where
MLP Derivatives
Using square loss
the derivatives with respect to parameters are given by
Stochastic Gradient Descent
- Sample random minibatch
from training data at every . - Calculate all partial derivatives
for - Perform update step with step size
:
Python Code
import numpy as np
from scipy.special import expit # logistic function
class MLP3():
def __init__(self, n, m):
# n is the number of input dimensions
# m is the number of output dimensions
self.theta = np.random.uniform(-1/n, 1/n, (m, n))
self.beta = np.random.uniform(-1/m, 1/m, (1, m))
self.n = n
self.m = m
def forward(self, x):
# x is of size [n, s]
f = {}
net_in = np.dot(self.theta, x) # [m, n] * [n, s] -> [m, s]
f['hid'] = expit(net_in) # [m, s]
f['out'] = np.dot(self.beta, f['hid']) # [1, m] * [m, s] -> [1, s]
def gradient(self, f, x, y):
g = {}
delta = f['out'] - y # [1, s]
hid = f['hid']
g['beta'] = np.dot(delta, hid.T)
# sum over s: [1, s] * [s, m] -> [1, m]
hid_sv = hid * (1 - hid) # [m, s]
outer = np.outer(self.beta.T, delta) # [m, 1] * [1, s] -> [m, s]
g['theta'] = np.dot(outer * hid_sv, x.T)
# [m, s] * [s, n] -> [m, n]
return g
def train(dataset, n, m, s, steps, eta):
# s is the batch size
model = MLP3(n, m)
for k in range(steps):
x, y = dataset.sample(s)
f = model.forward(x)
g = model.gradient(f, x, y)
model.beta = model.beta - (eta / s) * g['beta']
model.theta = model.theta - (eta / s) * g['theta']