Chapter 4: Multivariable Calculus & Gradients

Why This Chapter Matters

In Chapters 1-3, we explored calculus for functions with one input and one output — like temperature changing with time, or position changing with time. But the real world is far more complex!

Consider these scenarios:

Weather: Temperature depends on both your location (latitude, longitude) and time
Machine Learning: Your model's performance depends on thousands of parameters simultaneously
Physics: The electric field depends on your position in three-dimensional space
Medicine: Drug effectiveness depends on dosage, patient weight, age, genetics, and more

When we have multiple inputs affecting an output, we need multivariable calculus. This chapter teaches you how to understand and optimize systems where many things are changing at once — the foundation of modern machine learning, physics simulations, and engineering optimization.

What you'll master:

How to measure sensitivity when multiple factors are changing
How to find the steepest direction to climb a mountain (or minimize a loss function)
How gradient descent powers machine learning
How force fields work in physics
How to optimize complex systems with many variables

Functions of Multiple Variables: The Real World is Multi-Dimensional

🌡️ Temperature Example: Why One Variable Isn't Enough

Imagine you're a meteorologist trying to predict temperature. In our previous single-variable world, you might have said:

$T(t) = 20 + 5\sin(t)$

"Temperature depends only on time of day." But that's obviously incomplete! Temperature also depends on:

Location: It's colder at the North Pole than in Hawaii
Elevation: It's colder on top of a mountain
Season: January vs July makes a huge difference

So really, temperature is a function of multiple variables:

$T(x, y, z, t) = \text{Temperature at position }(x,y,z)\text{ and time }t$

📐 Mathematical Representation

A multivariable function takes multiple inputs and produces an output:

f(x, y, z, \ldots) = \text{some expression involving } x, y, z, \ldots

Examples:

Simple quadratic: $f(x, y) = x^2 + y^2$

Takes two inputs $(x, y)$
Outputs one number
Geometrically, this describes a paraboloid (like a bowl)

Distance function: $d(x, y) = \sqrt{x^2 + y^2}$

Distance from origin to point $(x, y)$
Always positive
Creates concentric circles of constant distance

Machine learning loss: $J(\theta_1, \theta_2, \ldots, \theta_n) = \frac{1}{m}\sum_{i=1}^m (\hat{y}_i - y_i)^2$

Takes model parameters $\theta_1, \theta_2, \ldots, \theta_n$ as inputs
Outputs how "wrong" the model is
We want to minimize this function

🎯 Visualizing Multivariable Functions

For functions of two variables, we can visualize them as 3D surfaces:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create a grid of x and y values
x = np.linspace(-3, 3, 50)
y = np.linspace(-3, 3, 50)
X, Y = np.meshgrid(x, y)

# Define different functions
Z1 = X**2 + Y**2  # Paraboloid (bowl shape)
Z2 = np.sin(X) * np.cos(Y)  # Wavy surface
Z3 = X**2 - Y**2  # Saddle point

# Create subplot with three surfaces
fig = plt.figure(figsize=(15, 5))

# Paraboloid
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z1, cmap='viridis', alpha=0.7)
ax1.set_title('f(x,y) = x² + y²\n(Paraboloid)')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')

# Wavy surface
ax2 = fig.add_subplot(132, projection='3d')
ax2.plot_surface(X, Y, Z2, cmap='plasma', alpha=0.7)
ax2.set_title('f(x,y) = sin(x)cos(y)\n(Wavy Surface)')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.set_zlabel('f(x,y)')

# Saddle point
ax3 = fig.add_subplot(133, projection='3d')
ax3.plot_surface(X, Y, Z3, cmap='coolwarm', alpha=0.7)
ax3.set_title('f(x,y) = x² - y²\n(Saddle Point)')
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.set_zlabel('f(x,y)')

plt.tight_layout()
plt.show()

🔍 Understanding the Shapes

Paraboloid ( $f(x,y) = x^2 + y^2$ ):

Bowl shape - has a clear minimum at $(0,0)$
As you move away from center in any direction, the function value increases
This is like a "loss function" in ML - we want to find the bottom!

Saddle Point ( $f(x,y) = x^2 - y^2$ ):

Horse saddle shape - goes up in $x$ -direction, down in $y$ -direction
At $(0,0)$ , it's a minimum in one direction but maximum in another
These are critical points that are neither minima nor maxima

Wavy Surface ( $f(x,y) = \sin(x)\cos(y)$ ):

Complex landscape with many hills and valleys
Shows how functions can have multiple local minima and maxima
Common in real-world optimization problems

🎯 Why This Matters for Applications

Machine Learning: Your loss function might depend on thousands of parameters. Understanding the "shape" of this high-dimensional landscape helps you:

Find good minima (train better models)
Avoid getting stuck in bad local minima
Choose appropriate optimization algorithms

Physics: Force fields, electric fields, gravitational fields - all depend on position in 3D space

Engineering: Optimizing designs often involves many variables simultaneously - material properties, dimensions, costs, performance metrics

Medicine: Drug interactions depend on multiple factors - dosages of different medications, patient characteristics, timing

Partial Derivatives: Measuring Change While Holding Things Constant

🏔️ The Mountain Hiking Analogy

Imagine you're standing on a mountainside. The elevation depends on both your east-west position ( $x$ ) and your north-south position ( $y$ ):

$h(x, y) = \text{elevation at position } (x, y)$

Now, suppose you want to know: "If I take a small step east, how much will my elevation change?"

To answer this, you need to:

Hold your north-south position fixed (don't move north or south)
Take a tiny step east and see how elevation changes
Measure the rate of change in that direction only

This is exactly what a partial derivative does!

🧮 Mathematical Definition

The partial derivative of $f(x, y)$ with respect to $x$ is:

\frac{\partial f}{\partial x} = \lim_{h\to0}\frac{f(x+h,y)-f(x,y)}{h}

Key insight: Notice that $y$ stays the same in both $f(x+h,y)$ and $f(x,y)$ . We're only varying $x$ .

🎯 Intuitive Understanding

Partial derivative with respect to $x$ : $\frac{\partial f}{\partial x}$

"How fast does $f$ change as I increase $x$ , while keeping $y$ fixed?"
It's like taking a regular derivative, but treating $y$ as a constant

Partial derivative with respect to $y$ : $\frac{\partial f}{\partial y}$

"How fast does $f$ change as I increase $y$ , while keeping $x$ fixed?"
Treat $x$ as a constant and take the derivative with respect to $y$

🧪 Step-by-Step Example

Let's compute partial derivatives for: $f(x,y) = x^2y + 3xy^2$

Finding $\frac{\partial f}{\partial x}$ :

Step 1: Treat $y$ as a constant (like the number 5 or $\pi$ )

Step 2: Differentiate with respect to $x$ :

$\frac{\partial}{\partial x}[x^2y] = y \cdot \frac{\partial}{\partial x}[x^2] = y \cdot 2x = 2xy$
$\frac{\partial}{\partial x}[3xy^2] = 3y^2 \cdot \frac{\partial}{\partial x}[x] = 3y^2 \cdot 1 = 3y^2$

Result: $\frac{\partial f}{\partial x} = 2xy + 3y^2$

Finding $\frac{\partial f}{\partial y}$ :

Step 1: Treat $x$ as a constant

Step 2: Differentiate with respect to $y$ :

$\frac{\partial}{\partial y}[x^2y] = x^2 \cdot \frac{\partial}{\partial y}[y] = x^2 \cdot 1 = x^2$
$\frac{\partial}{\partial y}[3xy^2] = 3x \cdot \frac{\partial}{\partial y}[y^2] = 3x \cdot 2y = 6xy$

Result: $\frac{\partial f}{\partial y} = x^2 + 6xy$

🎮 Interactive Understanding

Let's visualize how partial derivatives work:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define our function f(x,y) = x²y + 3xy²
def f(x, y):
    return x**2 * y + 3 * x * y**2

# Define partial derivatives
def df_dx(x, y):
    return 2*x*y + 3*y**2

def df_dy(x, y):
    return x**2 + 6*x*y

# Create a grid
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Pick a specific point
x0, y0 = 1, 0.5
z0 = f(x0, y0)

# Create the visualization
fig = plt.figure(figsize=(15, 5))

# Main 3D surface
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(X, Y, Z, alpha=0.6, cmap='viridis')
ax1.scatter([x0], [y0], [z0], color='red', s=100, label=f'Point ({x0}, {y0})')
ax1.set_title('f(x,y) = x²y + 3xy²')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.set_zlabel('f(x,y)')

# Slice holding y constant (showing ∂f/∂x)
ax2 = fig.add_subplot(132)
x_slice = np.linspace(-2, 2, 100)
y_fixed = y0
z_slice = f(x_slice, y_fixed)
ax2.plot(x_slice, z_slice, 'b-', linewidth=2, label=f'f(x, {y_fixed})')
ax2.scatter([x0], [z0], color='red', s=100, zorder=5)

# Draw tangent line at the point
slope_x = df_dx(x0, y0)
tangent_x = z0 + slope_x * (x_slice - x0)
ax2.plot(x_slice, tangent_x, 'r--', alpha=0.7,
         label=f'Tangent (slope = ∂f/∂x = {slope_x:.2f})')

ax2.set_title(f'Cross-section: y = {y_fixed} (constant)')
ax2.set_xlabel('x')
ax2.set_ylabel('f(x,y)')
ax2.legend()
ax2.grid(True)

# Slice holding x constant (showing ∂f/∂y)
ax3 = fig.add_subplot(133)
y_slice = np.linspace(-2, 2, 100)
x_fixed = x0
z_slice = f(x_fixed, y_slice)
ax3.plot(y_slice, z_slice, 'g-', linewidth=2, label=f'f({x_fixed}, y)')
ax3.scatter([y0], [z0], color='red', s=100, zorder=5)

# Draw tangent line at the point
slope_y = df_dy(x0, y0)
tangent_y = z0 + slope_y * (y_slice - y0)
ax3.plot(y_slice, tangent_y, 'r--', alpha=0.7,
         label=f'Tangent (slope = ∂f/∂y = {slope_y:.2f})')

ax3.set_title(f'Cross-section: x = {x_fixed} (constant)')
ax3.set_xlabel('y')
ax3.set_ylabel('f(x,y)')
ax3.legend()
ax3.grid(True)

plt.tight_layout()
plt.show()

print(f"At point ({x0}, {y0}):")
print(f"∂f/∂x = {df_dx(x0, y0)} (slope in x-direction)")
print(f"∂f/∂y = {df_dy(x0, y0)} (slope in y-direction)")

🧠 Conceptual Insight

Why partial derivatives matter:

Sensitivity analysis: Which variables have the biggest impact on your function?
Optimization: Which direction should you move to increase/decrease the function?
Approximation: How does the function behave near a specific point?

🔬 Real-World Applications

Machine Learning:

If $J(\theta_1, \theta_2)$ $J (θ_{1}, θ_{2})$ is your loss function, then:
- $\frac{\partial J}{\partial \theta_1}$ tells you how to adjust parameter $\theta_1$
- $\frac{\partial J}{\partial \theta_2}$ tells you how to adjust parameter $\theta_2$

Physics:

Electric field: $\mathbf{E} = -\nabla V$ where $V(x,y,z)$ is electric potential
Each component $E_x = -\frac{\partial V}{\partial x}$ shows the force in that direction

Economics:

Production function $P(L, K)$ depends on Labor and Capital
$\frac{\partial P}{\partial L}$ = marginal productivity of labor
$\frac{\partial P}{\partial K}$ = marginal productivity of capital

Medicine:

Drug effectiveness $E(d_1, d_2, w, a)$ depends on dose1, dose2, weight, age
$\frac{\partial E}{\partial d_1}$ shows how sensitive effectiveness is to the first drug's dosage

The Gradient: The "Steepest Uphill" Vector

🧭 The Mountain Climbing Analogy

You're standing on a mountainside in dense fog. You can't see very far, but you have a magical compass that always points in the direction you should walk to climb upward as quickly as possible.

This magical compass is the gradient!

Here's what it tells you:

Direction: Which way to face to climb most steeply upward
Magnitude: How steep the climb is in that direction
- Large gradient = very steep terrain
- Small gradient = gentle slope
- Zero gradient = you're at a peak or valley

📊 Mathematical Definition

The gradient of a function $f(x, y, z)$ is a vector made from all partial derivatives:

\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)

For 2D functions: $\nabla f(x,y) = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)$

For 3D functions: $\nabla f(x,y,z) = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)$

🎯 Why Gradients Point "Uphill"

Let's understand this intuitively. Suppose you're at point $(x_0, y_0)$ and you want to move a small distance in direction $(\cos\theta, \sin\theta)$ .

The directional derivative (rate of change in that direction) is:

D_{\theta} f = \frac{\partial f}{\partial x}\cos\theta + \frac{\partial f}{\partial y}\sin\theta = \nabla f \cdot (\cos\theta, \sin\theta)

This is the dot product of the gradient with your direction vector!

Key insight: The dot product is maximized when the two vectors point in the same direction. So $\nabla f$ points in the direction of maximum increase.

🧪 Step-by-Step Example

Let's compute the gradient of $f(x,y) = x^2 + y^2$ :

Step 1: Find partial derivatives

$\frac{\partial f}{\partial x} = 2x$
$\frac{\partial f}{\partial y} = 2y$

Step 2: Combine into gradient vector $\nabla f = (2x, 2y)$

Step 3: Interpret at specific points

At $(1, 1)$ : $\nabla f = (2, 2)$ → points toward $(1, 1)$ direction
At $(-1, 2)$ : $\nabla f = (-2, 4)$ → points toward $(-1, 2)$ direction
At $(0, 0)$ : $\nabla f = (0, 0)$ → no preferred direction (we're at the minimum!)

🎨 Visualizing Gradient Fields

Let's create beautiful visualizations to understand gradients:

import numpy as np
import matplotlib.pyplot as plt

# Create a grid of points
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)

# Function: f(x,y) = x² + y²
Z = X**2 + Y**2

# Gradient components
dX = 2 * X  # ∂f/∂x = 2x
dY = 2 * Y  # ∂f/∂y = 2y

# Create the visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Contour plot with gradient vectors
ax1 = axes[0, 0]
contour = ax1.contour(X, Y, Z, levels=10, colors='gray', alpha=0.5)
ax1.clabel(contour, inline=True, fontsize=8)
ax1.quiver(X, Y, dX, dY, color='red', alpha=0.8, scale=50)
ax1.set_title('Gradients on Contour Plot\nf(x,y) = x² + y²')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.grid(True, alpha=0.3)
ax1.set_aspect('equal')

# 2. Gradient magnitude
ax2 = axes[0, 1]
magnitude = np.sqrt(dX**2 + dY**2)
im = ax2.imshow(magnitude, extent=[-3, 3, -3, 3], origin='lower', cmap='hot')
ax2.contour(X, Y, magnitude, colors='white', alpha=0.5)
plt.colorbar(im, ax=ax2, label='|∇f|')
ax2.set_title('Gradient Magnitude\n|∇f| = √((2x)² + (2y)²)')
ax2.set_xlabel('x')
ax2.set_ylabel('y')

# 3. Different function: f(x,y) = x² - y² (saddle point)
Z2 = X**2 - Y**2
dX2 = 2 * X
dY2 = -2 * Y

ax3 = axes[1, 0]
contour2 = ax3.contour(X, Y, Z2, levels=15, colors='gray', alpha=0.5)
ax3.quiver(X, Y, dX2, dY2, color='blue', alpha=0.8, scale=50)
ax3.set_title('Saddle Point Function\nf(x,y) = x² - y²')
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.grid(True, alpha=0.3)
ax3.set_aspect('equal')

# 4. Wavy function: f(x,y) = sin(x)cos(y)
Z3 = np.sin(X) * np.cos(Y)
dX3 = np.cos(X) * np.cos(Y)
dY3 = -np.sin(X) * np.sin(Y)

ax4 = axes[1, 1]
contour3 = ax4.contour(X, Y, Z3, levels=10, colors='gray', alpha=0.5)
ax4.quiver(X, Y, dX3, dY3, color='green', alpha=0.8, scale=20)
ax4.set_title('Complex Landscape\nf(x,y) = sin(x)cos(y)')
ax4.set_xlabel('x')
ax4.set_ylabel('y')
ax4.grid(True, alpha=0.3)
ax4.set_aspect('equal')

plt.tight_layout()
plt.show()

🔍 Key Insights from the Visualizations

Paraboloid ( $f(x,y) = x^2 + y^2$ ):

Gradients always point away from the center (0,0)
Magnitude increases as you move away from center
This creates a "flow field" toward the minimum

Saddle Point ( $f(x,y) = x^2 - y^2$ ):

Complex gradient pattern
Some directions go "uphill", others "downhill"
Center point (0,0) has zero gradient but is neither min nor max

Wavy Surface ( $f(x,y) = \sin(x)\cos(y)$ ):

Multiple local maxima and minima
Gradients point toward nearby peaks
Shows why optimization can be challenging

🎯 Gradient Properties

Direction: Always points toward steepest increase
Magnitude: Tells you how steep the increase is
Zero Gradient: Critical points (peaks, valleys, saddle points)
Perpendicular to Contours: Gradients always cross level curves at right angles

🧠 Intuitive Understanding: Why Perpendicular to Contours?

Think about contour lines on a topographic map:

Contour lines connect points of equal elevation
If you walk along a contour line, your elevation doesn't change
The steepest uphill direction must be perpendicular to the contour

This is exactly what gradients do — they point perpendicular to contours, in the direction of steepest ascent!

🔬 Real-World Applications

Machine Learning - Gradient Descent:

Loss function $J(\theta_1, \theta_2, \ldots)$ depends on model parameters
Gradient $\nabla J$ points toward increasing loss (bad direction)
Move in opposite direction: $\theta \leftarrow \theta - \alpha \nabla J$
This minimizes loss and improves the model

Physics - Force Fields:

Force is negative gradient of potential energy: $\mathbf{F} = -\nabla V$
Particles naturally move toward lower potential energy
Examples: gravity, electric fields, magnetic fields

Engineering - Heat Flow:

Heat flows from hot to cold regions
Temperature gradient $\nabla T$ points toward increasing temperature
Heat flow is proportional to $-\nabla T$ (Fourier's law)

Computer Graphics:

Gradients compute surface normals for lighting calculations
Edge detection uses gradients to find rapid changes in image intensity

Gradients in Physics: Force Fields and Natural Laws

⚡ Forces from Potential Energy

One of the most beautiful applications of gradients is in physics, where forces are related to potential energy through:

\mathbf{F} = -\nabla V(x,y,z)

Why the negative sign?

Gradient points toward increasing potential energy
Forces point toward decreasing potential energy (systems naturally move to lower energy states)
Hence the negative sign

🌍 Gravitational Example

Gravitational potential energy near Earth's surface: $V(h) = mgh$

Gravitational force: $F = -\frac{dV}{dh} = -mg$

The negative sign indicates the force points downward (toward decreasing potential energy).

In 3D space, gravitational potential around a mass $M$ is: $V(x,y,z) = -\frac{GMm}{\sqrt{x^2 + y^2 + z^2}}$

The gravitational force is: $\mathbf{F} = -\nabla V = -GMm \frac{(x,y,z)}{(x^2 + y^2 + z^2)^{3/2}}$

This points toward the center of mass — exactly what we expect!

🔌 Electric Fields

Electric potential $V(x,y,z)$ creates an electric field: $\mathbf{E} = -\nabla V$

Example: Point charge $Q$ at origin

Potential: $V(x,y,z) = \frac{kQ}{\sqrt{x^2 + y^2 + z^2}}$
Electric field: $\mathbf{E} = \frac{kQ}{r^3}(x,y,z)$ (points radially outward for positive $Q$ )

🌡️ Heat Flow

Fourier's Law of heat conduction: $\mathbf{q} = -k\nabla T$

Where:

$\mathbf{q}$ = heat flux (energy per unit area per time)
$k$ = thermal conductivity
$\nabla T$ = temperature gradient

Physical meaning: Heat flows from hot to cold regions, proportional to the temperature gradient.

Gradients in Machine Learning: The Engine of AI

🎯 The Optimization Problem

Machine learning is fundamentally an optimization problem:

Define a loss function $J(\theta_1, \theta_2, \ldots, \theta_n)$ that measures how "wrong" your model is
Find parameter values $\theta$ that minimize this loss
Use gradients to guide your search for the minimum

🚀 Gradient Descent: Following the Steepest Path Downhill

Basic idea: If gradients point "uphill", then negative gradients point "downhill" toward minima.

Update rule: $\theta \leftarrow \theta - \alpha\nabla J(\theta)$

Where:

$\alpha$ = learning rate (how big steps to take)
$\nabla J(\theta)$ = gradient of loss function
Minus sign = move in opposite direction of gradient (downhill)

🔍 Interactive Gradient Descent Visualization

Let's create a comprehensive visualization showing how gradient descent works:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

def gradient_descent_visualization():
    # Define different loss functions to explore
    def rosenbrock(x, y):
        """Rosenbrock function - challenging optimization landscape"""
        return (1 - x)**2 + 100 * (y - x**2)**2

    def rosenbrock_grad(x, y):
        dx = -2*(1 - x) - 400*x*(y - x**2)
        dy = 200*(y - x**2)
        return dx, dy

    def simple_quadratic(x, y):
        """Simple bowl-shaped function"""
        return x**2 + y**2

    def simple_grad(x, y):
        return 2*x, 2*y

    def saddle_point(x, y):
        """Saddle point function"""
        return x**2 - y**2

    def saddle_grad(x, y):
        return 2*x, -2*y

    # Choose function to optimize
    func = simple_quadratic
    grad_func = simple_grad
    x_range, y_range = (-3, 3), (-3, 3)

    # Create grid for contour plot
    x = np.linspace(x_range[0], x_range[1], 100)
    y = np.linspace(y_range[0], y_range[1], 100)
    X, Y = np.meshgrid(x, y)
    Z = func(X, Y)

    # Gradient descent with different learning rates
    def run_gradient_descent(start_point, lr, steps):
        path = [start_point]
        point = np.array(start_point, dtype=float)

        for _ in range(steps):
            grad = np.array(grad_func(point[0], point[1]))
            point = point - lr * grad
            path.append(point.copy())

            # Stop if gradient is very small (near minimum)
            if np.linalg.norm(grad) < 1e-6:
                break

        return np.array(path)

    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # Different learning rates and starting points
    scenarios = [
        {'lr': 0.01, 'start': [2.5, 2.5], 'color': 'red', 'title': 'Small LR (0.01)'},
        {'lr': 0.1, 'start': [2.5, 2.5], 'color': 'blue', 'title': 'Good LR (0.1)'},
        {'lr': 0.5, 'start': [2.5, 2.5], 'color': 'green', 'title': 'Large LR (0.5)'},
        {'lr': 0.1, 'start': [-2, 1.5], 'color': 'purple', 'title': 'Different Start'}
    ]

    for i, scenario in enumerate(scenarios):
        ax = axes[i//2, i%2]

        # Plot contour
        contour = ax.contour(X, Y, Z, levels=20, colors='gray', alpha=0.4)
        ax.clabel(contour, inline=True, fontsize=8, fmt='%.1f')

        # Run gradient descent
        path = run_gradient_descent(scenario['start'], scenario['lr'], 100)

        # Plot path
        ax.plot(path[:, 0], path[:, 1], 'o-', color=scenario['color'],
                linewidth=2, markersize=4, alpha=0.8, label='GD Path')
        ax.plot(path[0, 0], path[0, 1], 'o', color=scenario['color'],
                markersize=10, label='Start')
        ax.plot(path[-1, 0], path[-1, 1], 's', color=scenario['color'],
                markersize=10, label='End')

        # Add gradient arrows at a few points
        if len(path) > 5:
            for j in range(0, min(len(path)-1, 20), 5):
                x_pt, y_pt = path[j]
                dx, dy = grad_func(x_pt, y_pt)
                # Normalize for visualization
                norm = np.sqrt(dx**2 + dy**2)
                if norm > 1e-6:
                    dx, dy = dx/norm * 0.2, dy/norm * 0.2
                    ax.arrow(x_pt, y_pt, -dx, -dy, head_width=0.1,
                            head_length=0.05, fc='black', ec='black', alpha=0.6)

        ax.set_xlim(x_range)
        ax.set_ylim(y_range)
        ax.set_xlabel('θ₁')
        ax.set_ylabel('θ₂')
        ax.set_title(f'{scenario["title"]}\nSteps: {len(path)-1}, Final loss: {func(path[-1, 0], path[-1, 1]):.3f}')
        ax.grid(True, alpha=0.3)
        ax.legend(fontsize=8)
        ax.set_aspect('equal')

    plt.tight_layout()
    plt.show()

    # Print analysis
    print("🎯 Gradient Descent Analysis:")
    print("=" * 50)
    for i, scenario in enumerate(scenarios):
        path = run_gradient_descent(scenario['start'], scenario['lr'], 100)
        print(f"{scenario['title']}: {len(path)-1} steps, final loss = {func(path[-1, 0], path[-1, 1]):.6f}")

gradient_descent_visualization()

🧠 Key Insights from the Visualization

Learning Rate Effects:

Too small (0.01): Very slow convergence, many steps needed
Just right (0.1): Efficient convergence in reasonable steps
Too large (0.5): May overshoot or oscillate

Starting Point: Different initial values can lead to different local minima in complex landscapes

🔬 Real ML Applications

Linear Regression:

Loss: $J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$
Gradients tell us how to adjust intercept and slope

Neural Networks:

Backpropagation computes gradients with respect to all weights and biases
Chain rule connects output error to input layer gradients

Logistic Regression:

Loss: $J(\theta) = -\frac{1}{m}\sum_{i=1}^m [y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$
Gradients guide classification boundary optimization

🎯 Advanced Optimization Algorithms

Momentum: $v = \beta v + (1-\beta)\nabla J$ $\theta = \theta - \alpha v$

Adam: Combines momentum with adaptive learning rates

All based on gradients — they just use gradient information more cleverly!

The Jacobian: When Outputs Are Vectors Too

🔄 From Single Output to Multiple Outputs

So far, we've studied functions with multiple inputs and single output: $f(x, y, z) \rightarrow \text{single number}$

But what about functions with multiple inputs AND multiple outputs?

$\mathbf{F}(x, y) = \begin{bmatrix} f_1(x, y) \\ f_2(x, y) \end{bmatrix} \rightarrow \text{vector of numbers}$

Examples:

Coordinate transformations: $(x, y) \rightarrow (r, \theta)$ (Cartesian to polar)
Neural network layers: Input vector → Output vector
Physics: Position $(x, y, z)$ → Velocity vector $(v_x, v_y, v_z)$

🧮 Mathematical Definition

For a vector-valued function $\mathbf{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ :

$\mathbf{F}(x_1, x_2, \ldots, x_n) = \begin{bmatrix} f_1(x_1, x_2, \ldots, x_n) \\ f_2(x_1, x_2, \ldots, x_n) \\ \vdots \\ f_m(x_1, x_2, \ldots, x_n) \end{bmatrix}$

The Jacobian matrix is:

J(\mathbf{F}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Each row is the gradient of one output function.

🎯 Concrete Example: Coordinate Transformation

Cartesian to Polar coordinates: $\mathbf{F}(x, y) = \begin{bmatrix} r(x, y) \\ \theta(x, y) \end{bmatrix} = \begin{bmatrix} \sqrt{x^2 + y^2} \\ \arctan(y/x) \end{bmatrix}$

Step 1: Find partial derivatives of $r(x, y) = \sqrt{x^2 + y^2}$

$\frac{\partial r}{\partial x} = \frac{x}{\sqrt{x^2 + y^2}}$
$\frac{\partial r}{\partial y} = \frac{y}{\sqrt{x^2 + y^2}}$

Step 2: Find partial derivatives of $\theta(x, y) = \arctan(y/x)$

$\frac{\partial \theta}{\partial x} = \frac{-y}{x^2 + y^2}$
$\frac{\partial \theta}{\partial y} = \frac{x}{x^2 + y^2}$

Step 3: Assemble the Jacobian

J(\mathbf{F}) = \begin{bmatrix} \frac{x}{\sqrt{x^2 + y^2}} & \frac{y}{\sqrt{x^2 + y^2}} \\ \frac{-y}{x^2 + y^2} & \frac{x}{x^2 + y^2} \end{bmatrix}

🔍 What Does the Jacobian Tell Us?

Linear approximation: Near point $(x_0, y_0)$ , the function behaves like: $\mathbf{F}(x_0 + \Delta x, y_0 + \Delta y) \approx \mathbf{F}(x_0, y_0) + J(\mathbf{F})|_{(x_0,y_0)} \begin{bmatrix} \Delta x \\ \Delta y \end{bmatrix}$

Geometric interpretation: The Jacobian tells us how small regions get stretched, rotated, and skewed by the transformation.

🎨 Visualizing Jacobian Transformations

import numpy as np
import matplotlib.pyplot as plt

def visualize_jacobian_transformation():
    # Define a transformation: (x,y) -> (x + y, x - y)
    def transform(x, y):
        return x + y, x - y

    def jacobian_transform(x, y):
        # J = [[1, 1], [1, -1]]
        return np.array([[1, 1], [1, -1]])

    # Create a grid of points (unit square)
    x = np.array([0, 1, 1, 0, 0])  # Square vertices + closing
    y = np.array([0, 0, 1, 1, 0])

    # Transform the points
    x_new, y_new = transform(x, y)

    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Original space
    ax1 = axes[0]
    ax1.plot(x, y, 'b-o', linewidth=2, markersize=8, label='Original Square')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-0.5, 2.5)
    ax1.set_ylim(-0.5, 2.5)
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_title('Original Space')
    ax1.legend()
    ax1.set_aspect('equal')

    # Transformed space
    ax2 = axes[1]
    ax2.plot(x_new, y_new, 'r-o', linewidth=2, markersize=8, label='Transformed')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(-0.5, 2.5)
    ax2.set_ylim(-1.5, 1.5)
    ax2.set_xlabel('u = x + y')
    ax2.set_ylabel('v = x - y')
    ax2.set_title('Transformed Space')
    ax2.legend()
    ax2.set_aspect('equal')

    # Show both together
    ax3 = axes[2]
    ax3.plot(x, y, 'b-o', linewidth=2, markersize=8, label='Original', alpha=0.7)
    ax3.plot(x_new, y_new, 'r-o', linewidth=2, markersize=8, label='Transformed', alpha=0.7)

    # Draw transformation arrows
    for i in range(len(x)-1):  # Skip the last point (closing the square)
        ax3.arrow(x[i], y[i], x_new[i]-x[i], y_new[i]-y[i],
                 head_width=0.1, head_length=0.05, fc='green', ec='green', alpha=0.6)

    ax3.grid(True, alpha=0.3)
    ax3.set_xlim(-0.5, 2.5)
    ax3.set_ylim(-1.5, 2.5)
    ax3.set_xlabel('x / u')
    ax3.set_ylabel('y / v')
    ax3.set_title('Transformation Visualization')
    ax3.legend()

    plt.tight_layout()
    plt.show()

    # Print the Jacobian
    J = jacobian_transform(0, 0)  # Constant in this case
    print("Jacobian Matrix:")
    print(J)
    print(f"Determinant: {np.linalg.det(J)}")
    print("This transformation has area scaling factor of", abs(np.linalg.det(J)))

visualize_jacobian_transformation()

🔬 Applications in Machine Learning

Neural Networks:

Each layer is a function $\mathbf{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m$
Backpropagation uses the chain rule with Jacobians
Gradients flow backwards through network via Jacobian matrices

Generative Models:

Transform simple noise $\mathbf{z}$ to complex data $\mathbf{x} = \mathbf{F}(\mathbf{z})$
Jacobian determinant appears in probability calculations

Optimization:

Newton's method uses Jacobian for faster convergence
Constrained optimization uses Jacobians of constraint functions

Chapter 4 Summary

🎯 Key Concepts Mastered

1. Multivariable Functions

Why multiple variables: Real-world depends on many factors simultaneously
Visualization: 3D surfaces, contour plots, complex landscapes
Applications: Temperature fields, loss functions, force fields

2. Partial Derivatives

Core idea: Rate of change while holding other variables constant
Mountain analogy: Slope in one direction while staying on the same latitude/longitude
Computation: Treat other variables as constants, differentiate normally

3. Gradients - The Steepest Direction

Vector of partial derivatives: $\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z})$
Geometric meaning: Points toward steepest increase, perpendicular to contours
Magnitude: How steep the steepest direction is

4. Physics Applications

Force fields: $\mathbf{F} = -\nabla V$ (forces from potential energy)
Heat flow: $\mathbf{q} = -k\nabla T$ (heat flows down temperature gradients)
Electric fields: $\mathbf{E} = -\nabla V$ (electric field from potential)

5. Machine Learning Applications

Gradient descent: $\theta \leftarrow \theta - \alpha\nabla J(\theta)$
Optimization: Following negative gradients to minimize loss
Learning rates: Balance between speed and stability

6. Jacobian Matrices

Multiple outputs: When functions return vectors, not just scalars
Linear approximation: How transformations behave locally
Applications: Neural networks, coordinate transformations, physics

🔗 Connections to Previous Chapters

Chapter 1: Exponential/logarithmic functions appear in multivariable contexts
Chapter 2: Partial derivatives extend single-variable derivative rules
Chapter 3: Multiple integrals (coming in advanced topics) use gradients

🎯 Applications Preview

Coming in later chapters:

Linear Algebra: Vectors and matrices provide the language for gradients and Jacobians
Optimization: Advanced algorithms beyond basic gradient descent
Machine Learning: Backpropagation, neural networks, deep learning
Statistics: Maximum likelihood estimation uses gradients

🧮 Key Formulas to Remember

\begin{aligned} \text{Partial derivative: } &\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h,y) - f(x,y)}{h} \\ \text{Gradient: } &\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right) \\ \text{Gradient descent: } &\theta \leftarrow \theta - \alpha\nabla J(\theta) \\ \text{Physics force: } &\mathbf{F} = -\nabla V \\ \text{Jacobian: } &J_{ij} = \frac{\partial f_i}{\partial x_j} \end{aligned}

You now have the mathematical tools to understand and optimize complex systems where many variables interact — the foundation of modern AI and scientific computing! 🚀

Key Takeaways

In Chapters 1-3, we explored calculus for functions with one input and one output — like temperature changing with time, or posit…
When we have multiple inputs affecting an output, we need multivariable calculus.
This chapter teaches you how to understand and optimize systems where many things are changing at once — the foundation of modern mac…
Imagine you're a meteorologist trying to predict temperature.
In our previous single-variable world, you might have said:

Chapter 4: Multivariable Calculus & Gradients

Why This Chapter Matters

Functions of Multiple Variables: The Real World is Multi-Dimensional

🌡️ Temperature Example: Why One Variable Isn't Enough

📐 Mathematical Representation

Examples:

🎯 Visualizing Multivariable Functions

🔍 Understanding the Shapes

🎯 Why This Matters for Applications

Partial Derivatives: Measuring Change While Holding Things Constant

🏔️ The Mountain Hiking Analogy

🧮 Mathematical Definition

🎯 Intuitive Understanding

🧪 Step-by-Step Example

Finding ∂f∂x\frac{\partial f}{\partial x}∂x∂f​:

Finding ∂f∂y\frac{\partial f}{\partial y}∂y∂f​:

🎮 Interactive Understanding

🧠 Conceptual Insight

🔬 Real-World Applications

The Gradient: The "Steepest Uphill" Vector

🧭 The Mountain Climbing Analogy

📊 Mathematical Definition

🎯 Why Gradients Point "Uphill"

🧪 Step-by-Step Example

🎨 Visualizing Gradient Fields

🔍 Key Insights from the Visualizations

🎯 Gradient Properties

🧠 Intuitive Understanding: Why Perpendicular to Contours?

🔬 Real-World Applications

Gradients in Physics: Force Fields and Natural Laws

⚡ Forces from Potential Energy

🌍 Gravitational Example

🔌 Electric Fields

🌡️ Heat Flow

Gradients in Machine Learning: The Engine of AI

🎯 The Optimization Problem

🚀 Gradient Descent: Following the Steepest Path Downhill

🔍 Interactive Gradient Descent Visualization

🧠 Key Insights from the Visualization

🔬 Real ML Applications

🎯 Advanced Optimization Algorithms

The Jacobian: When Outputs Are Vectors Too

🔄 From Single Output to Multiple Outputs

🧮 Mathematical Definition

🎯 Concrete Example: Coordinate Transformation

🔍 What Does the Jacobian Tell Us?

🎨 Visualizing Jacobian Transformations

🔬 Applications in Machine Learning

Chapter 4 Summary

🎯 Key Concepts Mastered

🔗 Connections to Previous Chapters

🎯 Applications Preview

🧮 Key Formulas to Remember

Key Takeaways

Finding $\frac{\partial f}{\partial x}$ :

Finding $\frac{\partial f}{\partial y}$ :