Multilayer Perceptrons

Goal of Deep Learning

Deep neural networks combine compositionality (depth) with expansiveness (width) to learn representations.

Expansiveness: neural networks can map simple input data space to high dimensional feature space:

H : R^{n} \mapsto R^{p},

where $p ≫ n$ .

Compositionality: neural networks compose maps to make information more accessible and explicit with increasing depth.

Universality

Linear Maps

A linear map can be simply parameterized by a weight matrix:

F : R^{n} \mapsto R^{m}, F (x; Θ) = Θ x .

However, linear maps are closed under composition,

(F \circ G) (x) = F (G (x)) = Θ_{F} Θ_{G} x = Θ x .

Ridge Functions

Ridge functions introduce non-linearity, which combine linear maps with activation functions:

H = Φ \circ F, H (x; Θ) = Φ (Θ x),

where $Φ$ is the activation function with scalar non-linearity:

Φ (z) = (\begin{matrix} ϕ (z_{1}) \\ ⋮ \\ ϕ (z_{m}) \end{matrix}) .

universal approximation theorem

Neural networks with one hidden layer and a non-polynomial activation are universal function approximators.

Activation Functions

Logistic function (sigmoid): $σ (z) = \frac{1}{1 + e^{- z}}$ , which maps $R$ to $(0, 1)$ .
Hyperbolic tangent: $\tanh (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$ , which maps $R$ to $(- 1, 1)$ .
ReLU: $f (z) = z_{+} = max {0, z}$ .

Classical Multilayer Perceptron

A classical single-hidden layer MLP can be represented by

ψ (x; β, Θ) = β^{T} σ (Θ x) = \sum_{j = 1}^{m} \frac{β_{j}}{1 + \exp [- θ_{j}^{T} x]},

where $m$ is the number of hidden dimensions.

MLP Derivatives

Using square loss

ℓ (x, y; β, Θ) = \frac{1}{2} (ψ (x) - y)^{2},

the derivatives with respect to parameters are given by

\begin{aligned} \frac{\partial ℓ}{\partial β_{j}} & = \frac{ψ (x) - y}{1 + \exp [- θ_{j}^{T} x]}, \\ \frac{\partial ℓ}{\partial θ_{j i}} & = \frac{ψ (x) - y}{1 + \exp [- θ_{j}^{T} x]} \cdot \frac{β_{j} x_{i}}{1 + \exp [θ_{j}^{T} x]} . \end{aligned}

Stochastic Gradient Descent

Sample random minibatch $S_{t}$ from training data $S$ at every $t$ .
Calculate all partial derivatives $\nabla_{ϑ} (S_{t}) = \sum_{(x, y) \in S_{t}} \frac{\partial ℓ (x, y)}{\partial ϑ}$ for $ϑ \in {β_{i}, θ_{j i}}$
Perform update step with step size $η > 0$ :

ϑ_{t + 1} \leftarrow ϑ_{t} - η \nabla_{ϑ} (S_{t}) .

Python Code

import numpy as np
from scipy.special import expit # logistic function

class MLP3():
	def __init__(self, n, m):
		# n is the number of input dimensions
		# m is the number of output dimensions
		self.theta = np.random.uniform(-1/n, 1/n, (m, n))
		self.beta = np.random.uniform(-1/m, 1/m, (1, m))
		self.n = n
		self.m = m

	def forward(self, x):
		# x is of size [n, s]
		f = {}
		net_in = np.dot(self.theta, x) # [m, n] * [n, s] -> [m, s]
		f['hid'] = expit(net_in)  # [m, s]
		f['out'] = np.dot(self.beta, f['hid']) # [1, m] * [m, s] -> [1, s]

	def gradient(self, f, x, y):
		g = {}
		delta = f['out'] - y  # [1, s]
		hid = f['hid']  
		g['beta'] = np.dot(delta, hid.T) 
		# sum over s: [1, s] * [s, m] -> [1, m]

		hid_sv = hid * (1 - hid)  # [m, s]
		outer = np.outer(self.beta.T, delta)  # [m, 1] * [1, s] -> [m, s]
		g['theta'] = np.dot(outer * hid_sv, x.T)
		# [m, s] * [s, n] -> [m, n]

		return g

def train(dataset, n, m, s, steps, eta):
	# s is the batch size
	model = MLP3(n, m)
	for k in range(steps):
		x, y = dataset.sample(s)
		f = model.forward(x)
		g = model.gradient(f, x, y)
		model.beta = model.beta - (eta / s) * g['beta']
		model.theta = model.theta - (eta / s) * g['theta']