Multilayer Perceptrons

Goal of Deep Learning

Deep neural networks combine compositionality (depth) with expansiveness (width) to learn representations.

H:RnRp,

where pn.

Universality

Linear Maps

A linear map can be simply parameterized by a weight matrix:

F:RnRm,F(x;Θ)=Θx.

However, linear maps are closed under composition,

(FG)(x)=F(G(x))=ΘFΘGx=Θx.

Ridge Functions

Ridge functions introduce non-linearity, which combine linear maps with activation functions:

H=ΦF,H(x;Θ)=Φ(Θx),

where Φ is the activation function with scalar non-linearity:

Φ(z)=(ϕ(z1)ϕ(zm)).
universal approximation theorem

Neural networks with one hidden layer and a non-polynomial activation are universal function approximators.

Activation Functions

Classical Multilayer Perceptron

A classical single-hidden layer MLP can be represented by

ψ(x;β,Θ)=βTσ(Θx)=j=1mβj1+exp[θjTx],

where m is the number of hidden dimensions.

MLP Derivatives

Using square loss

(x,y;β,Θ)=12(ψ(x)y)2,

the derivatives with respect to parameters are given by

βj=ψ(x)y1+exp[θjTx],θji=ψ(x)y1+exp[θjTx]βjxi1+exp[θjTx].

Stochastic Gradient Descent

  1. Sample random minibatch St from training data S at every t.
  2. Calculate all partial derivatives ϑ(St)=(x,y)St(x,y)ϑ for ϑ{βi,θji}
  3. Perform update step with step size η>0:
ϑt+1ϑtηϑ(St).

Python Code

import numpy as np
from scipy.special import expit # logistic function

class MLP3():
	def __init__(self, n, m):
		# n is the number of input dimensions
		# m is the number of output dimensions
		self.theta = np.random.uniform(-1/n, 1/n, (m, n))
		self.beta = np.random.uniform(-1/m, 1/m, (1, m))
		self.n = n
		self.m = m

	def forward(self, x):
		# x is of size [n, s]
		f = {}
		net_in = np.dot(self.theta, x) # [m, n] * [n, s] -> [m, s]
		f['hid'] = expit(net_in)  # [m, s]
		f['out'] = np.dot(self.beta, f['hid']) # [1, m] * [m, s] -> [1, s]

	def gradient(self, f, x, y):
		g = {}
		delta = f['out'] - y  # [1, s]
		hid = f['hid']  
		g['beta'] = np.dot(delta, hid.T) 
		# sum over s: [1, s] * [s, m] -> [1, m]

		hid_sv = hid * (1 - hid)  # [m, s]
		outer = np.outer(self.beta.T, delta)  # [m, 1] * [1, s] -> [m, s]
		g['theta'] = np.dot(outer * hid_sv, x.T)
		# [m, s] * [s, n] -> [m, n]

		return g

def train(dataset, n, m, s, steps, eta):
	# s is the batch size
	model = MLP3(n, m)
	for k in range(steps):
		x, y = dataset.sample(s)
		f = model.forward(x)
		g = model.gradient(f, x, y)
		model.beta = model.beta - (eta / s) * g['beta']
		model.theta = model.theta - (eta / s) * g['theta']