Mathematical Awakening: Connecting the Equations of Nature and Intelligence · Chapter 4 · 10 min read · code · math

Chapter 4: Multivariable Calculus & Gradients

Chapter 4: Multivariable Calculus & Gradients

Why This Chapter Matters

In Chapters 1-3, we explored calculus for functions with one input and one output — like temperature changing with time, or position changing with time. But the real world is far more complex!

Consider these scenarios:

  • Weather: Temperature depends on both your location (latitude, longitude) and time
  • Machine Learning: Your model's performance depends on thousands of parameters simultaneously
  • Physics: The electric field depends on your position in three-dimensional space
  • Medicine: Drug effectiveness depends on dosage, patient weight, age, genetics, and more

When we have multiple inputs affecting an output, we need multivariable calculus. This chapter teaches you how to understand and optimize systems where many things are changing at once — the foundation of modern machine learning, physics simulations, and engineering optimization.

What you'll master:

  • How to measure sensitivity when multiple factors are changing
  • How to find the steepest direction to climb a mountain (or minimize a loss function)
  • How gradient descent powers machine learning
  • How force fields work in physics
  • How to optimize complex systems with many variables

Functions of Multiple Variables: The Real World is Multi-Dimensional

🌡️ Temperature Example: Why One Variable Isn't Enough

Imagine you're a meteorologist trying to predict temperature. In our previous single-variable world, you might have said:

T(t)=20+5sin(t)T(t) = 20 + 5\sin(t)

"Temperature depends only on time of day." But that's obviously incomplete! Temperature also depends on:

  • Location: It's colder at the North Pole than in Hawaii
  • Elevation: It's colder on top of a mountain
  • Season: January vs July makes a huge difference

So really, temperature is a function of multiple variables:

T(x,y,z,t)=Temperature at position (x,y,z) and time tT(x, y, z, t) = \text{Temperature at position }(x,y,z)\text{ and time }t

📐 Mathematical Representation

A multivariable function takes multiple inputs and produces an output:

f(x,y,z,)=some expression involving x,y,z,f(x, y, z, \ldots) = \text{some expression involving } x, y, z, \ldots

Examples:

Simple quadratic: f(x,y)=x2+y2f(x, y) = x^2 + y^2

  • Takes two inputs (x,y)(x, y)
  • Outputs one number
  • Geometrically, this describes a paraboloid (like a bowl)

Distance function: d(x,y)=x2+y2d(x, y) = \sqrt{x^2 + y^2}

  • Distance from origin to point (x,y)(x, y)
  • Always positive
  • Creates concentric circles of constant distance

Machine learning loss: J(θ1,θ2,,θn)=1mi=1m(y^iyi)2J(\theta_1, \theta_2, \ldots, \theta_n) = \frac{1}{m}\sum_{i=1}^m (\hat{y}_i - y_i)^2

  • Takes model parameters θ1,θ2,,θn\theta_1, \theta_2, \ldots, \theta_n as inputs
  • Outputs how "wrong" the model is
  • We want to minimize this function

🎯 Visualizing Multivariable Functions

For functions of two variables, we can visualize them as 3D surfaces:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create a grid of x and y values
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)

# Define different functions
Z1 = X**2 + Y**2  # Paraboloid (bowl shape)
Z2 = np.sin(X) * np.cos(Y)  # Wavy surface
Z3 = X**2 - Y**2  # Saddle point

# Create subplot with three surfaces
fig = plt.figure(figsize=(15, 5))

# Paraboloid
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z1, cmap='viridis', alpha=0.7)
ax1.set_title('f(x,y) = x² + y²\n(Paraboloid)')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')

# Wavy surface
ax2 = fig.add_subplot(132, projection='3d')
ax2.plot_surface(X, Y, Z2, cmap='plasma', alpha=0.7)
ax2.set_title('f(x,y) = sin(x)cos(y)\n(Wavy Surface)')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')

# Saddle point
ax3 = fig.add_subplot(133, projection='3d')
ax3.plot_surface(X, Y, Z3, cmap='coolwarm', alpha=0.7)
ax3.set_title('f(x,y) = x² - y²\n(Saddle Point)')
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.set_zlabel('f(x,y)')

plt.tight_layout()
plt.show()

🔍 Understanding the Shapes

Paraboloid (f(x,y)=x2+y2f(x,y) = x^2 + y^2):

  • Bowl shape - has a clear minimum at (0,0)(0,0)
  • As you move away from center in any direction, the function value increases
  • This is like a "loss function" in ML - we want to find the bottom!

Saddle Point (f(x,y)=x2y2f(x,y) = x^2 - y^2):

  • Horse saddle shape - goes up in xx-direction, down in yy-direction
  • At (0,0)(0,0), it's a minimum in one direction but maximum in another
  • These are critical points that are neither minima nor maxima

Wavy Surface (f(x,y)=sin(x)cos(y)f(x,y) = \sin(x)\cos(y)):

  • Complex landscape with many hills and valleys
  • Shows how functions can have multiple local minima and maxima
  • Common in real-world optimization problems

🎯 Why This Matters for Applications

Machine Learning: Your loss function might depend on thousands of parameters. Understanding the "shape" of this high-dimensional landscape helps you:

  • Find good minima (train better models)
  • Avoid getting stuck in bad local minima
  • Choose appropriate optimization algorithms

Physics: Force fields, electric fields, gravitational fields - all depend on position in 3D space

Engineering: Optimizing designs often involves many variables simultaneously - material properties, dimensions, costs, performance metrics

Medicine: Drug interactions depend on multiple factors - dosages of different medications, patient characteristics, timing


Partial Derivatives: Measuring Change While Holding Things Constant

🏔️ The Mountain Hiking Analogy

Imagine you're standing on a mountainside. The elevation depends on both your east-west position (xx) and your north-south position (yy):

h(x,y)=elevation at position (x,y)h(x, y) = \text{elevation at position } (x, y)

Now, suppose you want to know: "If I take a small step east, how much will my elevation change?"

To answer this, you need to:

  1. Hold your north-south position fixed (don't move north or south)
  2. Take a tiny step east and see how elevation changes
  3. Measure the rate of change in that direction only

This is exactly what a partial derivative does!

🧮 Mathematical Definition

The partial derivative of f(x,y)f(x, y) with respect to xx is:

fx=limh0f(x+h,y)f(x,y)h\frac{\partial f}{\partial x} = \lim_{h\to0}\frac{f(x+h,y)-f(x,y)}{h}

Key insight: Notice that yy stays the same in both f(x+h,y)f(x+h,y) and f(x,y)f(x,y). We're only varying xx.

🎯 Intuitive Understanding

Partial derivative with respect to xx: fx\frac{\partial f}{\partial x}

  • "How fast does ff change as I increase xx, while keeping yy fixed?"
  • It's like taking a regular derivative, but treating yy as a constant

Partial derivative with respect to yy: fy\frac{\partial f}{\partial y}

  • "How fast does ff change as I increase yy, while keeping xx fixed?"
  • Treat xx as a constant and take the derivative with respect to yy

🧪 Step-by-Step Example

Let's compute partial derivatives for: f(x,y)=x2y+3xy2f(x,y) = x^2y + 3xy^2

Finding fx\frac{\partial f}{\partial x}:

Step 1: Treat yy as a constant (like the number 5 or π\pi)

Step 2: Differentiate with respect to xx:

  • x[x2y]=yx[x2]=y2x=2xy\frac{\partial}{\partial x}[x^2y] = y \cdot \frac{\partial}{\partial x}[x^2] = y \cdot 2x = 2xy
  • x[3xy2]=3y2x[x]=3y21=3y2\frac{\partial}{\partial x}[3xy^2] = 3y^2 \cdot \frac{\partial}{\partial x}[x] = 3y^2 \cdot 1 = 3y^2

Result: fx=2xy+3y2\frac{\partial f}{\partial x} = 2xy + 3y^2

Finding fy\frac{\partial f}{\partial y}:

Step 1: Treat xx as a constant

Step 2: Differentiate with respect to yy:

  • y[x2y]=x2y[y]=x21=x2\frac{\partial}{\partial y}[x^2y] = x^2 \cdot \frac{\partial}{\partial y}[y] = x^2 \cdot 1 = x^2
  • y[3xy2]=3xy[y2]=3x2y=6xy\frac{\partial}{\partial y}[3xy^2] = 3x \cdot \frac{\partial}{\partial y}[y^2] = 3x \cdot 2y = 6xy

Result: fy=x2+6xy\frac{\partial f}{\partial y} = x^2 + 6xy

🎮 Interactive Understanding

Let's visualize how partial derivatives work:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define our function f(x,y) = x²y + 3xy²
def f(x, y):
    return x**2 * y + 3 * x * y**2

# Define partial derivatives
def df_dx(x, y):
    return 2*x*y + 3*y**2

def df_dy(x, y):
    return x**2 + 6*x*y

# Create a grid
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Pick a specific point
x0, y0 = 1, 0.5
z0 = f(x0, y0)

# Create the visualization
fig = plt.figure(figsize=(15, 5))

# Main 3D surface
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, alpha=0.6, cmap='viridis')
ax1.scatter([x0], [y0], [z0], color='red', s=100, label=f'Point ({x0}, {y0})')
ax1.set_title('f(x,y) = x²y + 3xy²')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')

# Slice holding y constant (showing ∂f/∂x)
ax2 = fig.add_subplot(132)
x_slice = np.linspace(-2, 2, 100)
y_fixed = y0
z_slice = f(x_slice, y_fixed)
ax2.plot(x_slice, z_slice, 'b-', linewidth=2, label=f'f(x, {y_fixed})')
ax2.scatter([x0], [z0], color='red', s=100, zorder=5)

# Draw tangent line at the point
slope_x = df_dx(x0, y0)
tangent_x = z0 + slope_x * (x_slice - x0)
ax2.plot(x_slice, tangent_x, 'r--', alpha=0.7,
         label=f'Tangent (slope = ∂f/∂x = {slope_x:.2f})')

ax2.set_title(f'Cross-section: y = {y_fixed} (constant)')
ax2.set_xlabel('x')
ax2.set_ylabel('f(x,y)')
ax2.legend()
ax2.grid(True)

# Slice holding x constant (showing ∂f/∂y)
ax3 = fig.add_subplot(133)
y_slice = np.linspace(-2, 2, 100)
x_fixed = x0
z_slice = f(x_fixed, y_slice)
ax3.plot(y_slice, z_slice, 'g-', linewidth=2, label=f'f({x_fixed}, y)')
ax3.scatter([y0], [z0], color='red', s=100, zorder=5)

# Draw tangent line at the point
slope_y = df_dy(x0, y0)
tangent_y = z0 + slope_y * (y_slice - y0)
ax3.plot(y_slice, tangent_y, 'r--', alpha=0.7,
         label=f'Tangent (slope = ∂f/∂y = {slope_y:.2f})')

ax3.set_title(f'Cross-section: x = {x_fixed} (constant)')
ax3.set_xlabel('y')
ax3.set_ylabel('f(x,y)')
ax3.legend()
ax3.grid(True)

plt.tight_layout()
plt.show()

print(f"At point ({x0}, {y0}):")
print(f"∂f/∂x = {df_dx(x0, y0)} (slope in x-direction)")
print(f"∂f/∂y = {df_dy(x0, y0)} (slope in y-direction)")

🧠 Conceptual Insight

Why partial derivatives matter:

  1. Sensitivity analysis: Which variables have the biggest impact on your function?
  2. Optimization: Which direction should you move to increase/decrease the function?
  3. Approximation: How does the function behave near a specific point?

🔬 Real-World Applications

Machine Learning:

  • If J(θ1,θ2)J(\theta_1, \theta_2) is your loss function, then:
    • Jθ1\frac{\partial J}{\partial \theta_1} tells you how to adjust parameter θ1\theta_1
    • Jθ2\frac{\partial J}{\partial \theta_2} tells you how to adjust parameter θ2\theta_2

Physics:

  • Electric field: E=V\mathbf{E} = -\nabla V where V(x,y,z)V(x,y,z) is electric potential
  • Each component Ex=VxE_x = -\frac{\partial V}{\partial x} shows the force in that direction

Economics:

  • Production function P(L,K)P(L, K) depends on Labor and Capital
  • PL\frac{\partial P}{\partial L} = marginal productivity of labor
  • PK\frac{\partial P}{\partial K} = marginal productivity of capital

Medicine:

  • Drug effectiveness E(d1,d2,w,a)E(d_1, d_2, w, a) depends on dose1, dose2, weight, age
  • Ed1\frac{\partial E}{\partial d_1} shows how sensitive effectiveness is to the first drug's dosage

The Gradient: The "Steepest Uphill" Vector

🧭 The Mountain Climbing Analogy

You're standing on a mountainside in dense fog. You can't see very far, but you have a magical compass that always points in the direction you should walk to climb upward as quickly as possible.

This magical compass is the gradient!

Here's what it tells you:

  1. Direction: Which way to face to climb most steeply upward

  2. Magnitude: How steep the climb is in that direction

    • Large gradient = very steep terrain
    • Small gradient = gentle slope
    • Zero gradient = you're at a peak or valley

📊 Mathematical Definition

The gradient of a function f(x,y,z)f(x, y, z) is a vector made from all partial derivatives:

f=(fx,fy,fz)\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)

For 2D functions: f(x,y)=(fx,fy)\nabla f(x,y) = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)

For 3D functions: f(x,y,z)=(fx,fy,fz)\nabla f(x,y,z) = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)

🎯 Why Gradients Point "Uphill"

Let's understand this intuitively. Suppose you're at point (x0,y0)(x_0, y_0) and you want to move a small distance in direction (cosθ,sinθ)(\cos\theta, \sin\theta).

The directional derivative (rate of change in that direction) is:

Dθf=fxcosθ+fysinθ=f(cosθ,sinθ)D_{\theta} f = \frac{\partial f}{\partial x}\cos\theta + \frac{\partial f}{\partial y}\sin\theta = \nabla f \cdot (\cos\theta, \sin\theta)

This is the dot product of the gradient with your direction vector!

Key insight: The dot product is maximized when the two vectors point in the same direction. So f\nabla f points in the direction of maximum increase.

🧪 Step-by-Step Example

Let's compute the gradient of f(x,y)=x2+y2f(x,y) = x^2 + y^2:

Step 1: Find partial derivatives

  • fx=2x\frac{\partial f}{\partial x} = 2x
  • fy=2y\frac{\partial f}{\partial y} = 2y

Step 2: Combine into gradient vector f=(2x,2y)\nabla f = (2x, 2y)

Step 3: Interpret at specific points

  • At (1,1)(1, 1): f=(2,2)\nabla f = (2, 2) → points toward (1,1)(1, 1) direction
  • At (1,2)(-1, 2): f=(2,4)\nabla f = (-2, 4) → points toward (1,2)(-1, 2) direction
  • At (0,0)(0, 0): f=(0,0)\nabla f = (0, 0) → no preferred direction (we're at the minimum!)

🎨 Visualizing Gradient Fields

Let's create beautiful visualizations to understand gradients:

import numpy as np
import matplotlib.pyplot as plt

# Create a grid of points
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)

# Function: f(x,y) = x² + y²
Z = X**2 + Y**2

# Gradient components
dX = 2 * X  # ∂f/∂x = 2x
dY = 2 * Y  # ∂f/∂y = 2y

# Create the visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Contour plot with gradient vectors
ax1 = axes[0, 0]
contour = ax1.contour(X, Y, Z, levels=10, colors='gray', alpha=0.5)
ax1.clabel(contour, inline=True, fontsize=8)
ax1.quiver(X, Y, dX, dY, color='red', alpha=0.8, scale=50)
ax1.set_title('Gradients on Contour Plot\nf(x,y) = x² + y²')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.grid(True, alpha=0.3)
ax1.set_aspect('equal')

# 2. Gradient magnitude
ax2 = axes[0, 1]
magnitude = np.sqrt(dX**2 + dY**2)
im = ax2.imshow(magnitude, extent=[-3, 3, -3, 3], origin='lower', cmap='hot')
ax2.contour(X, Y, magnitude, colors='white', alpha=0.5)
plt.colorbar(im, ax=ax2, label='|∇f|')
ax2.set_title('Gradient Magnitude\n|∇f| = √((2x)² + (2y)²)')
ax2.set_xlabel('x')
ax2.set_ylabel('y')

# 3. Different function: f(x,y) = x² - y² (saddle point)
Z2 = X**2 - Y**2
dX2 = 2 * X
dY2 = -2 * Y

ax3 = axes[1, 0]
contour2 = ax3.contour(X, Y, Z2, levels=15, colors='gray', alpha=0.5)
ax3.quiver(X, Y, dX2, dY2, color='blue', alpha=0.8, scale=50)
ax3.set_title('Saddle Point Function\nf(x,y) = x² - y²')
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.grid(True, alpha=0.3)
ax3.set_aspect('equal')

# 4. Wavy function: f(x,y) = sin(x)cos(y)
Z3 = np.sin(X) * np.cos(Y)
dX3 = np.cos(X) * np.cos(Y)
dY3 = -np.sin(X) * np.sin(Y)

ax4 = axes[1, 1]
contour3 = ax4.contour(X, Y, Z3, levels=10, colors='gray', alpha=0.5)
ax4.quiver(X, Y, dX3, dY3, color='green', alpha=0.8, scale=20)
ax4.set_title('Complex Landscape\nf(x,y) = sin(x)cos(y)')
ax4.set_xlabel('x')
ax4.set_ylabel('y')
ax4.grid(True, alpha=0.3)
ax4.set_aspect('equal')

plt.tight_layout()
plt.show()

🔍 Key Insights from the Visualizations

Paraboloid (f(x,y)=x2+y2f(x,y) = x^2 + y^2):

  • Gradients always point away from the center (0,0)
  • Magnitude increases as you move away from center
  • This creates a "flow field" toward the minimum

Saddle Point (f(x,y)=x2y2f(x,y) = x^2 - y^2):

  • Complex gradient pattern
  • Some directions go "uphill", others "downhill"
  • Center point (0,0) has zero gradient but is neither min nor max

Wavy Surface (f(x,y)=sin(x)cos(y)f(x,y) = \sin(x)\cos(y)):

  • Multiple local maxima and minima
  • Gradients point toward nearby peaks
  • Shows why optimization can be challenging

🎯 Gradient Properties

  1. Direction: Always points toward steepest increase
  2. Magnitude: Tells you how steep the increase is
  3. Zero Gradient: Critical points (peaks, valleys, saddle points)
  4. Perpendicular to Contours: Gradients always cross level curves at right angles

🧠 Intuitive Understanding: Why Perpendicular to Contours?

Think about contour lines on a topographic map:

  • Contour lines connect points of equal elevation
  • If you walk along a contour line, your elevation doesn't change
  • The steepest uphill direction must be perpendicular to the contour

This is exactly what gradients do — they point perpendicular to contours, in the direction of steepest ascent!

🔬 Real-World Applications

Machine Learning - Gradient Descent:

  • Loss function J(θ1,θ2,)J(\theta_1, \theta_2, \ldots) depends on model parameters
  • Gradient J\nabla J points toward increasing loss (bad direction)
  • Move in opposite direction: θθαJ\theta \leftarrow \theta - \alpha \nabla J
  • This minimizes loss and improves the model

Physics - Force Fields:

  • Force is negative gradient of potential energy: F=V\mathbf{F} = -\nabla V
  • Particles naturally move toward lower potential energy
  • Examples: gravity, electric fields, magnetic fields

Engineering - Heat Flow:

  • Heat flows from hot to cold regions
  • Temperature gradient T\nabla T points toward increasing temperature
  • Heat flow is proportional to T-\nabla T (Fourier's law)

Computer Graphics:

  • Gradients compute surface normals for lighting calculations
  • Edge detection uses gradients to find rapid changes in image intensity

Gradients in Physics: Force Fields and Natural Laws

⚡ Forces from Potential Energy

One of the most beautiful applications of gradients is in physics, where forces are related to potential energy through:

F=V(x,y,z)\mathbf{F} = -\nabla V(x,y,z)

Why the negative sign?

  • Gradient points toward increasing potential energy
  • Forces point toward decreasing potential energy (systems naturally move to lower energy states)
  • Hence the negative sign

🌍 Gravitational Example

Gravitational potential energy near Earth's surface: V(h)=mghV(h) = mgh

Gravitational force: F=dVdh=mgF = -\frac{dV}{dh} = -mg

The negative sign indicates the force points downward (toward decreasing potential energy).

In 3D space, gravitational potential around a mass MM is: V(x,y,z)=GMmx2+y2+z2V(x,y,z) = -\frac{GMm}{\sqrt{x^2 + y^2 + z^2}}

The gravitational force is: F=V=GMm(x,y,z)(x2+y2+z2)3/2\mathbf{F} = -\nabla V = -GMm \frac{(x,y,z)}{(x^2 + y^2 + z^2)^{3/2}}

This points toward the center of mass — exactly what we expect!

🔌 Electric Fields

Electric potential V(x,y,z)V(x,y,z) creates an electric field: E=V\mathbf{E} = -\nabla V

Example: Point charge QQ at origin

  • Potential: V(x,y,z)=kQx2+y2+z2V(x,y,z) = \frac{kQ}{\sqrt{x^2 + y^2 + z^2}}
  • Electric field: E=kQr3(x,y,z)\mathbf{E} = \frac{kQ}{r^3}(x,y,z) (points radially outward for positive QQ)

🌡️ Heat Flow

Fourier's Law of heat conduction: q=kT\mathbf{q} = -k\nabla T

Where:

  • q\mathbf{q} = heat flux (energy per unit area per time)
  • kk = thermal conductivity
  • T\nabla T = temperature gradient

Physical meaning: Heat flows from hot to cold regions, proportional to the temperature gradient.


Gradients in Machine Learning: The Engine of AI

🎯 The Optimization Problem

Machine learning is fundamentally an optimization problem:

  1. Define a loss function J(θ1,θ2,,θn)J(\theta_1, \theta_2, \ldots, \theta_n) that measures how "wrong" your model is
  2. Find parameter values θ\theta that minimize this loss
  3. Use gradients to guide your search for the minimum

🚀 Gradient Descent: Following the Steepest Path Downhill

Basic idea: If gradients point "uphill", then negative gradients point "downhill" toward minima.

Update rule: θθαJ(θ)\theta \leftarrow \theta - \alpha\nabla J(\theta)

Where:

  • α\alpha = learning rate (how big steps to take)
  • J(θ)\nabla J(\theta) = gradient of loss function
  • Minus sign = move in opposite direction of gradient (downhill)

🔍 Interactive Gradient Descent Visualization

Let's create a comprehensive visualization showing how gradient descent works:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

def gradient_descent_visualization():
    # Define different loss functions to explore
    def rosenbrock(x, y):
        """Rosenbrock function - challenging optimization landscape"""
        return (1 - x)**2 + 100 * (y - x**2)**2

    def rosenbrock_grad(x, y):
        dx = -2*(1 - x) - 400*x*(y - x**2)
        dy = 200*(y - x**2)
        return dx, dy

    def simple_quadratic(x, y):
        """Simple bowl-shaped function"""
        return x**2 + y**2

    def simple_grad(x, y):
        return 2*x, 2*y

    def saddle_point(x, y):
        """Saddle point function"""
        return x**2 - y**2

    def saddle_grad(x, y):
        return 2*x, -2*y

    # Choose function to optimize
    func = simple_quadratic
    grad_func = simple_grad
    x_range, y_range = (-3, 3), (-3, 3)

    # Create grid for contour plot
    x = np.linspace(x_range[0], x_range[1], 100)
    y = np.linspace(y_range[0], y_range[1], 100)
    X, Y = np.meshgrid(x, y)
    Z = func(X, Y)

    # Gradient descent with different learning rates
    def run_gradient_descent(start_point, lr, steps):
        path = [start_point]
        point = np.array(start_point, dtype=float)

        for _ in range(steps):
            grad = np.array(grad_func(point[0], point[1]))
            point = point - lr * grad
            path.append(point.copy())

            # Stop if gradient is very small (near minimum)
            if np.linalg.norm(grad) < 1e-6:
                break

        return np.array(path)

    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # Different learning rates and starting points
    scenarios = [
        {'lr': 0.01, 'start': [2.5, 2.5], 'color': 'red', 'title': 'Small LR (0.01)'},
        {'lr': 0.1, 'start': [2.5, 2.5], 'color': 'blue', 'title': 'Good LR (0.1)'},
        {'lr': 0.5, 'start': [2.5, 2.5], 'color': 'green', 'title': 'Large LR (0.5)'},
        {'lr': 0.1, 'start': [-2, 1.5], 'color': 'purple', 'title': 'Different Start'}
    ]

    for i, scenario in enumerate(scenarios):
        ax = axes[i//2, i%2]

        # Plot contour
        contour = ax.contour(X, Y, Z, levels=20, colors='gray', alpha=0.4)
        ax.clabel(contour, inline=True, fontsize=8, fmt='%.1f')

        # Run gradient descent
        path = run_gradient_descent(scenario['start'], scenario['lr'], 100)

        # Plot path
        ax.plot(path[:, 0], path[:, 1], 'o-', color=scenario['color'],
                linewidth=2, markersize=4, alpha=0.8, label='GD Path')
        ax.plot(path[0, 0], path[0, 1], 'o', color=scenario['color'],
                markersize=10, label='Start')
        ax.plot(path[-1, 0], path[-1, 1], 's', color=scenario['color'],
                markersize=10, label='End')

        # Add gradient arrows at a few points
        if len(path) > 5:
            for j in range(0, min(len(path)-1, 20), 5):
                x_pt, y_pt = path[j]
                dx, dy = grad_func(x_pt, y_pt)
                # Normalize for visualization
                norm = np.sqrt(dx**2 + dy**2)
                if norm > 1e-6:
                    dx, dy = dx/norm * 0.2, dy/norm * 0.2
                    ax.arrow(x_pt, y_pt, -dx, -dy, head_width=0.1,
                            head_length=0.05, fc='black', ec='black', alpha=0.6)

        ax.set_xlim(x_range)
        ax.set_ylim(y_range)
        ax.set_xlabel('θ₁')
        ax.set_ylabel('θ₂')
        ax.set_title(f'{scenario["title"]}\nSteps: {len(path)-1}, Final loss: {func(path[-1, 0], path[-1, 1]):.3f}')
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8)
        ax.set_aspect('equal')

    plt.tight_layout()
    plt.show()

    # Print analysis
    print("🎯 Gradient Descent Analysis:")
    print("=" * 50)
    for i, scenario in enumerate(scenarios):
        path = run_gradient_descent(scenario['start'], scenario['lr'], 100)
        print(f"{scenario['title']}: {len(path)-1} steps, final loss = {func(path[-1, 0], path[-1, 1]):.6f}")

gradient_descent_visualization()

🧠 Key Insights from the Visualization

Learning Rate Effects:

  • Too small (0.01): Very slow convergence, many steps needed
  • Just right (0.1): Efficient convergence in reasonable steps
  • Too large (0.5): May overshoot or oscillate

Starting Point: Different initial values can lead to different local minima in complex landscapes

🔬 Real ML Applications

Linear Regression:

  • Loss: J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
  • Gradients tell us how to adjust intercept and slope

Neural Networks:

  • Backpropagation computes gradients with respect to all weights and biases
  • Chain rule connects output error to input layer gradients

Logistic Regression:

  • Loss: J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]J(\theta) = -\frac{1}{m}\sum_{i=1}^m [y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))]
  • Gradients guide classification boundary optimization

🎯 Advanced Optimization Algorithms

Momentum: v=βv+(1β)Jv = \beta v + (1-\beta)\nabla J θ=θαv\theta = \theta - \alpha v

Adam: Combines momentum with adaptive learning rates

All based on gradients — they just use gradient information more cleverly!


The Jacobian: When Outputs Are Vectors Too

🔄 From Single Output to Multiple Outputs

So far, we've studied functions with multiple inputs and single output: f(x,y,z)single numberf(x, y, z) \rightarrow \text{single number}

But what about functions with multiple inputs AND multiple outputs?

F(x,y)=[f1(x,y)f2(x,y)]vector of numbers\mathbf{F}(x, y) = \begin{bmatrix} f_1(x, y) \\ f_2(x, y) \end{bmatrix} \rightarrow \text{vector of numbers}

Examples:

  • Coordinate transformations: (x,y)(r,θ)(x, y) \rightarrow (r, \theta) (Cartesian to polar)
  • Neural network layers: Input vector → Output vector
  • Physics: Position (x,y,z)(x, y, z) → Velocity vector (vx,vy,vz)(v_x, v_y, v_z)

🧮 Mathematical Definition

For a vector-valued function F:RnRm\mathbf{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m:

F(x1,x2,,xn)=[f1(x1,x2,,xn)f2(x1,x2,,xn)fm(x1,x2,,xn)]\mathbf{F}(x_1, x_2, \ldots, x_n) = \begin{bmatrix} f_1(x_1, x_2, \ldots, x_n) \\ f_2(x_1, x_2, \ldots, x_n) \\ \vdots \\ f_m(x_1, x_2, \ldots, x_n) \end{bmatrix}

The Jacobian matrix is:

J(F)=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]J(\mathbf{F}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output function.

🎯 Concrete Example: Coordinate Transformation

Cartesian to Polar coordinates: F(x,y)=[r(x,y)θ(x,y)]=[x2+y2arctan(y/x)]\mathbf{F}(x, y) = \begin{bmatrix} r(x, y) \\ \theta(x, y) \end{bmatrix} = \begin{bmatrix} \sqrt{x^2 + y^2} \\ \arctan(y/x) \end{bmatrix}

Step 1: Find partial derivatives of r(x,y)=x2+y2r(x, y) = \sqrt{x^2 + y^2}

  • rx=xx2+y2\frac{\partial r}{\partial x} = \frac{x}{\sqrt{x^2 + y^2}}
  • ry=yx2+y2\frac{\partial r}{\partial y} = \frac{y}{\sqrt{x^2 + y^2}}

Step 2: Find partial derivatives of θ(x,y)=arctan(y/x)\theta(x, y) = \arctan(y/x)

  • θx=yx2+y2\frac{\partial \theta}{\partial x} = \frac{-y}{x^2 + y^2}
  • θy=xx2+y2\frac{\partial \theta}{\partial y} = \frac{x}{x^2 + y^2}

Step 3: Assemble the Jacobian

J(F)=[xx2+y2yx2+y2yx2+y2xx2+y2]J(\mathbf{F}) = \begin{bmatrix} \frac{x}{\sqrt{x^2 + y^2}} & \frac{y}{\sqrt{x^2 + y^2}} \\ \frac{-y}{x^2 + y^2} & \frac{x}{x^2 + y^2} \end{bmatrix}

🔍 What Does the Jacobian Tell Us?

Linear approximation: Near point (x0,y0)(x_0, y_0), the function behaves like: F(x0+Δx,y0+Δy)F(x0,y0)+J(F)(x0,y0)[ΔxΔy]\mathbf{F}(x_0 + \Delta x, y_0 + \Delta y) \approx \mathbf{F}(x_0, y_0) + J(\mathbf{F})|_{(x_0,y_0)} \begin{bmatrix} \Delta x \\ \Delta y \end{bmatrix}

Geometric interpretation: The Jacobian tells us how small regions get stretched, rotated, and skewed by the transformation.

🎨 Visualizing Jacobian Transformations

import numpy as np
import matplotlib.pyplot as plt

def visualize_jacobian_transformation():
    # Define a transformation: (x,y) -> (x + y, x - y)
    def transform(x, y):
        return x + y, x - y

    def jacobian_transform(x, y):
        # J = [[1, 1], [1, -1]]
        return np.array([[1, 1], [1, -1]])

    # Create a grid of points (unit square)
    x = np.array([0, 1, 1, 0, 0])  # Square vertices + closing
    y = np.array([0, 0, 1, 1, 0])

    # Transform the points
    x_new, y_new = transform(x, y)

    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Original space
    ax1 = axes[0]
    ax1.plot(x, y, 'b-o', linewidth=2, markersize=8, label='Original Square')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-0.5, 2.5)
    ax1.set_ylim(-0.5, 2.5)
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_title('Original Space')
    ax1.legend()
    ax1.set_aspect('equal')

    # Transformed space
    ax2 = axes[1]
    ax2.plot(x_new, y_new, 'r-o', linewidth=2, markersize=8, label='Transformed')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(-0.5, 2.5)
    ax2.set_ylim(-1.5, 1.5)
    ax2.set_xlabel('u = x + y')
    ax2.set_ylabel('v = x - y')
    ax2.set_title('Transformed Space')
    ax2.legend()
    ax2.set_aspect('equal')

    # Show both together
    ax3 = axes[2]
    ax3.plot(x, y, 'b-o', linewidth=2, markersize=8, label='Original', alpha=0.7)
    ax3.plot(x_new, y_new, 'r-o', linewidth=2, markersize=8, label='Transformed', alpha=0.7)

    # Draw transformation arrows
    for i in range(len(x)-1):  # Skip the last point (closing the square)
        ax3.arrow(x[i], y[i], x_new[i]-x[i], y_new[i]-y[i],
                 head_width=0.1, head_length=0.05, fc='green', ec='green', alpha=0.6)

    ax3.grid(True, alpha=0.3)
    ax3.set_xlim(-0.5, 2.5)
    ax3.set_ylim(-1.5, 2.5)
    ax3.set_xlabel('x / u')
    ax3.set_ylabel('y / v')
    ax3.set_title('Transformation Visualization')
    ax3.legend()

    plt.tight_layout()
    plt.show()

    # Print the Jacobian
    J = jacobian_transform(0, 0)  # Constant in this case
    print("Jacobian Matrix:")
    print(J)
    print(f"Determinant: {np.linalg.det(J)}")
    print("This transformation has area scaling factor of", abs(np.linalg.det(J)))

visualize_jacobian_transformation()

🔬 Applications in Machine Learning

Neural Networks:

  • Each layer is a function F:RnRm\mathbf{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m
  • Backpropagation uses the chain rule with Jacobians
  • Gradients flow backwards through network via Jacobian matrices

Generative Models:

  • Transform simple noise z\mathbf{z} to complex data x=F(z)\mathbf{x} = \mathbf{F}(\mathbf{z})
  • Jacobian determinant appears in probability calculations

Optimization:

  • Newton's method uses Jacobian for faster convergence
  • Constrained optimization uses Jacobians of constraint functions

Chapter 4 Summary

🎯 Key Concepts Mastered

1. Multivariable Functions

  • Why multiple variables: Real-world depends on many factors simultaneously
  • Visualization: 3D surfaces, contour plots, complex landscapes
  • Applications: Temperature fields, loss functions, force fields

2. Partial Derivatives

  • Core idea: Rate of change while holding other variables constant
  • Mountain analogy: Slope in one direction while staying on the same latitude/longitude
  • Computation: Treat other variables as constants, differentiate normally

3. Gradients - The Steepest Direction

  • Vector of partial derivatives: f=(fx,fy,fz)\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z})
  • Geometric meaning: Points toward steepest increase, perpendicular to contours
  • Magnitude: How steep the steepest direction is

4. Physics Applications

  • Force fields: F=V\mathbf{F} = -\nabla V (forces from potential energy)
  • Heat flow: q=kT\mathbf{q} = -k\nabla T (heat flows down temperature gradients)
  • Electric fields: E=V\mathbf{E} = -\nabla V (electric field from potential)

5. Machine Learning Applications

  • Gradient descent: θθαJ(θ)\theta \leftarrow \theta - \alpha\nabla J(\theta)
  • Optimization: Following negative gradients to minimize loss
  • Learning rates: Balance between speed and stability

6. Jacobian Matrices

  • Multiple outputs: When functions return vectors, not just scalars
  • Linear approximation: How transformations behave locally
  • Applications: Neural networks, coordinate transformations, physics

🔗 Connections to Previous Chapters

  • Chapter 1: Exponential/logarithmic functions appear in multivariable contexts
  • Chapter 2: Partial derivatives extend single-variable derivative rules
  • Chapter 3: Multiple integrals (coming in advanced topics) use gradients

🎯 Applications Preview

Coming in later chapters:

  • Linear Algebra: Vectors and matrices provide the language for gradients and Jacobians
  • Optimization: Advanced algorithms beyond basic gradient descent
  • Machine Learning: Backpropagation, neural networks, deep learning
  • Statistics: Maximum likelihood estimation uses gradients

🧮 Key Formulas to Remember

Partial derivative: fx=limh0f(x+h,y)f(x,y)hGradient: f=(fx,fy,fz)Gradient descent: θθαJ(θ)Physics force: F=VJacobian: Jij=fixj\begin{aligned} \text{Partial derivative: } &\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h,y) - f(x,y)}{h} \\ \text{Gradient: } &\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right) \\ \text{Gradient descent: } &\theta \leftarrow \theta - \alpha\nabla J(\theta) \\ \text{Physics force: } &\mathbf{F} = -\nabla V \\ \text{Jacobian: } &J_{ij} = \frac{\partial f_i}{\partial x_j} \end{aligned}

You now have the mathematical tools to understand and optimize complex systems where many variables interact — the foundation of modern AI and scientific computing! 🚀


Key Takeaways

  • In Chapters 1-3, we explored calculus for functions with one input and one output — like temperature changing with time, or posit…
  • When we have multiple inputs affecting an output, we need multivariable calculus.
  • This chapter teaches you how to understand and optimize systems where many things are changing at once — the foundation of modern mac…
  • Imagine you're a meteorologist trying to predict temperature.
  • In our previous single-variable world, you might have said: