Announcements & Reminders
Quizzes
All grading is done for Quiz 1
Quiz 2 is next Monday, March 25 8:00 to 23:59, same setup
The midterm is scheduled at 18:30 to 19:45 (1 hour + 15 min reading time) on Monday, April 15 in Melville Hall
π Multivariate calculus#
β± | words
Today we continue to get comfortable with the multidimensional space of real numbers \(\mathbb{R}^N\)
Main goal: learn and practice multidimensional differentiation
But first
Refresh definition of derivative and differentiability
Revision of univariate differentiation#
Derivatives#
Definition
Let \(f \colon X \to \mathbb{R}\)
Let \(H\) be the set of all convergent sequences \(\{h_n\}\) such that \(h_n \ne 0\) and \(h_n \to 0\)
If there exists a constant \(f'(x)\) such that
for every \(\{h_n\} \in H\), then
\(f\) is said to be differentiable at \(x\)
\(f'(x)\) is called the derivative of \(f\) at \(x\)
We will use the following notation for derivative interchangeably
Definition
A function \(f \colon X \to \mathbb{R}\) is called differentiable (on \(X\)) if it is differentiable in all points of \(X\)
Example
Let \(f \colon \mathbb{R} \to \mathbb{R}\) be defined by \(f(x) = x^2\)
Fix any \(x \in \mathbb{R}\) and any \(h_n \to 0\) with \(n \to \infty\)
We have
Example
Let \(f \colon \mathbb{R} \to \mathbb{R}\) be defined by \(f(x) = |x|\)
This function is not differentiable at \(x=0\)
Indeed, if \(h_n = 1/n\), then
On the other hand, if \(h_n = -1/n\), then
Therefore the definition of the derivative is not satisfied
Heine and Cauchy definitions of the limit for functions#
As a side note, letβs mention that there are two ways to define the limit of a function \(f\).
In the last lecture when talking about continuity and just above in the definition of the derivative we implicitly used the definition of function limit defined through limit of sequences, due to Eduard Heine
Definition (Limit of a function due to Heine)
Given the function \(f: A \subset \mathbb{R}^N \to \mathbb{R}\), \(a \in \mathbb{R}\) is a limit of \(f(x)\) as \(x \to x_0 \in A\) if
The alternative definition of the limit is due to Augustin-Louis Cauchy, also known as \(\epsilon\)-\(\delta\) definition
Definition (Limit of a function due to Cauchy)
Given the function \(f: A \subset \mathbb{R}^N \to \mathbb{R}\), \(a \in \mathbb{R}\) is a limit of \(f(x)\) as \(x \to x_0 \in A\) if
note the similarity to the definition of the sequence
note that in the \(\epsilon\)-ball around \(x_0\) the point \(x_0\) is excluded
Fact
Cauchy and Heine definitions of the function limit are equivalent
Therefore we can use the same notation of the definition of the limit of a function
Differentiation rules#
Fact
For functions \(f(x)\), \(g(x)\) and \(h(x)\), and constant \(c\), we have
Scalar Multiplication Rule
Summation Rule:
Product Rule:
Quotient Rule:
Chain Rule:
The Inverse Function Rule:
Differentiability and continuity#
Just a reminder: by definition a function \(f: X \to \mathbb{R}\) is continuous in \(a\) if \(f(x) \to f(a)\) as \(x \to a\)
Fact
A differentiable function \(f \colon X \to \mathbb{R}\) is continuous on \(X\).
Converse is not true in general.
Proof
Consider a function \(f: X \longrightarrow \mathbb{R}\) where \(X \subseteq \mathbb{R}\). Suppose that
exists.
We want to show that this implies that \(f(x)\) is continuous at the point \(a \in X\). The following proof of this proposition is drawn from [AyresΒ Jr and Mendelson, 2013] (Chapter 8, Solved Problem 2).
First, note that
Thus we have
Now note that
Upon combining these two results, we obtain
Finally, note that
Thus we have
This means that \(f(x)\) is continuous at the point \(x=a\).
Consider the function
(There is no problem with this double definition at the point \(x=1\) because the two parts of the function are equal at that point.)
This function is continuous at \(x=1\) because
and
However, this function is not differentiable at \(x=1\) because
and
Since
we know that
does not exist.
\(\blacksquare\)
Higher order derivatives#
Consider a derivative \(f'(x)\) of a function \(f(x)\) as a new function \(g(x)\). We can then take the derivative of \(g(x)\) to obtain a second order derivative of \(f(x)\), denoted by \(f''(x)\).
Definition
Let \(f \colon X \to \mathbb{R}\) and let \(x \in (a, b)\)
A second order derivative of \(f\) at \(x\) denoted \(f''(x)\) or \(\frac{d^2 f}{dxdx}\) is the (first order) derivative of the function \(f'(x)\), if it exists
More generally, the \(k\)-th order derivative of \(f\) at \(x\) denoted \(f^{(m)}(x)\) or \(\frac{d^k f}{dx \dots dx}\) is the derivative of function \(\frac{d^{k-1} f}{dx \dots dx}\), if it exists
Definition
The function \(f: X \to \mathbb{R}\) is said to be of differentiability class \(C^m\) if derivatives \(f'\), \(f''\), \(\dots\), \(f^{(m)}\) exist and are continuous on \(X\)
Taylor series#
Fact
Consider \(f: X \to \mathbb{R}\) and let \(f\) to be a \(C^m\) function. Assume also that \(f^{(m+1)}\) exists, although may not necessarily be continuous.
For any \(x,a \in X\) there is such \(z\) between \(a\) and \(x\) that
where the remainder is given by
Definition
Little-o notation is used to describe functions that approach zero faster than a given function
Loosely speaking, if \(f \colon \mathbb{R} \to \mathbb{R}\) is suitably differentiable at \(a\), then
for \(x\) very close to \(a\),
on a slightly wider interval, etc.
These are the 1st and 2nd order Taylor series approximations to \(f\) at \(a\) respectively
As the order goes higher we get better approximation
Multivariate differentiation#
Functions from \(\mathbb{R}^N\) to \(\mathbb{R}\)#
Extending the one-dimensional definition of derivative to multi-variate case is tricky!
With multiple variables, we have to decide which one to increment with the elements of the sequence \(\{h_n\}\)
We need a \(i\)-th unit vector of the form \((0, \dots, 0, 1, 0, \dots, 0)\) where \(1\) is in the \(i\)-th position to choose one particular variable
Definition
Let \(f \colon A \to \mathbb{R}\) and let \(x \in A \subset \mathbb{R}^N\). Denote \(e_i\) the \(i\)-th unit vector in \(\mathbb{R}^N\), i.e. \(e_i = (0, \dots, 0, 1, 0, \dots, 0)\) where \(1\) is in the \(i\)-th position.
If there exists a constant \(a \in \mathbb{R}\) such that
for every sequence \(\{h_n\}\), \(h_n \in \mathbb{R}\), such that \(h_n \ne 0\) and \(h_n \to 0\) as \(n \to \infty\), then \(a\) is called a partial derivative of \(f\) with respect to \(x_i\) at \(x\), and is denoted
Note the slight difference in notation \(\frac{d f}{d x}(x)\) of univariate derivative and \(\frac{\partial f}{\partial x_i}(x)\) of partial derivative
Partial derivatives behave just like regular derivatives of the univariate functions, assuming all variables except one are treated as constants
Usual rules of differentiation apply
Example
Partial derivatives are
However, the partial derivatives do not get it there all the way:
full analogue of the multivariate derivative to the univariate one should include all sequences converging to zero in the \(\mathbb{R}^N\) space?
what is the analogue of differentiability in the multivariate case, with the same implication to the continuety?
Definition
Let
\(A \subset \mathbb{R}^N\) be an open set in \(N\)-dimensional space
\(f \colon A \to \mathbb{R}\) be multivariate function
\(x \in A\), and so \(\exists \epsilon \colon B_{\epsilon}(x) \subset A\)
Then if there exists a vector \(g \in \mathbb{R}^N\) such that
for all converging to zero sequences \(\{h_n\}\), \(h_n \in B_{\epsilon}(0) \setminus \{0\}\), \(h_n \to 0\), then
\(f\) is said to be differentiable at \(x\)
vector \(g\) is called the gradient of \(f\) at \(x\), and is denoted by \(\nabla f(x)\) or simply \(f'(x)\)
note the dot product of \(g\) and \(h_n\) in the definition
the gradient is a vector of partial derivatives
Fact
If a function \(f \colon \mathbb{R}^N \to \mathbb{R}\) is differentiable at \(x\), then all partial derivatives at \(x\) exist and the gradient is given by
Proof
The idea of the proof: consider all converging sequences in \(\mathbb{R}^N\) and corresponding set of sequences in each dimension, component-wise.
Example
Continuing the previous example, the gradient is
Now, evaluating at a particular points (bold \({\bf 1}\) and \({\bf 0}\) denotes unit vector and zero vector in \mathbb{R}^3)
Fact
Sufficient conditions for differentiability of a function \(f \colon \mathbb{R}^N \to \mathbb{R}\) at \(x\) are that all partial derivatives exist and are continuous in \(x\)
Proof
The idea of the proof: consider all converging sequences in \(\mathbb{R}^N\) as composed component-wise from the converging sequences in each dimension, and verify that due to continuity, the vector of partial derivatives forms the vector \(g\) required by the definition of differentiability.
Directional derivatives#
To understand the role of the gradient of \(\mathbb{R}^N \to \mathbb{R}\) function, we need to introduce the concept of the directional derivative
Direction in \(\mathbb{R}^N\) can be given by any vector, after all vector has
absolute value (length)
direction But we should focus on the unit vectors, i.e. vectors of length 1
Definition
Let
\(A \subset \mathbb{R}^N\) be an open set in \(N\)-dimensional space
\(f \colon A \to \mathbb{R}\) be multivariate function
\(x \in A\), and so \(\exists \epsilon \colon B_{\epsilon}(x) \subset A\)
\(v \in \mathbb{R}^N\) is a unit vector, i.e. \(\|v\| = 1\)
Then if there exists a constant \(D_v f(x) \in \mathbb{R}\) such that
then it is called a directional derivative of \(f\) at \(x\) in the direction of \(v\)
note that the step size \(h\) here is scalar
the directional derivative is a scalar as well
the whole expression \(\frac{f(x + hv) - f(x)}{h}\) can be thought of a univariate function of \(h\)
the directional derivative is a simple (univariate) derivative of this function
Example
Let \(v = (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}},0)\), that is we are interested in the derivative along the \(45^\circ\) line in the \(x\)-\(y\) plane. Note that \(\|v\|=1\).
After substitution of \(x+hv\) into \(f\), we get the function
Differentiating \(g(x,h)\) with respect to \(h\) gives
Finally, because we are interested in the derivative at the point \(x\) corresponding to \(h=0\), we have
Fact
Directional derivative of a differentiable function \(f \colon \mathbb{R}^N \to \mathbb{R}\) at \(x\) in given by
Proof
Main idea: carefully convert the limit from the definition of differentiability of \(f\) to the univariate limit in the definition of the directional derivative. The latter happens to equal teh dot product of the gradient and the direction vector.
Example
Continuing the previous example, we have \(v = (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}},0)\) and
Therefore, applying the formula \(D_v f(x) = \nabla f(x) \cdot v\) we immediately have
Fact
The direction of the fastest ascent of a differentiable function \(f \colon \mathbb{R}^N \to \mathbb{R}\) at \(x\) is given by the direction of its gradient \(\nabla f(x)\).
Proof
First, CauchyβBunyakovskyβSchwarz inequality for a Euclidean space \(\mathbb{R}^N\) states that for any two vectors \(u, v \in \mathbb{R}^N\) we have
where β\(\cdot\)β is dot product and \(\|x\| = \sqrt{x \cdot x}\) is the standard Euclidean norm.
Consider the case when at least one element of \(\nabla f(x)\) is positive, otherwise \(D_v f(x) = 0\) implying that the function does not increase in any direction.
For any unit direction vector \(v\) (in the same general direction as \(\nabla f(x)\) so that \(\nabla f(x) \cdot v >0\)), we have
The inequality is due to CauchyβBunyakovskyβSchwarz.
Now consider \(v = \frac{\nabla f(x)}{\|\nabla f(x)\|}\). For this direction vector we have
Thus, the upper bound on \(D_v f(x)\) is achieved when vector \(v\) has the same direction as gradient vector \(\nabla f(x)\). \(\blacksquare\)
Steepest descent algorithm uses the opposite direction of the gradient
Stochastic steepest descent algorithm uses the same idea (ML and AI applications!)
Functions from \(\mathbb{R}^N\) to \(\mathbb{R}^K\)#
Letβs now consider more complex case: functions from multi-dimensional space to multi-dimensional space
Definition
The vector-valued function \(f \colon \mathbb{R}^N \to \mathbb{R}^K\) is a tuple of \(K\) functions \(f_j \colon \mathbb{R}^N \to \mathbb{R}\), \(j \in \{1, \dots, K\}\)
We shall typically assume that the vector-valued functions return a column-vector
Definition
Let
\(A \subset \mathbb{R}^N\) be an open set in \(N\)-dimensional space.
\(f \colon A \to \mathbb{R}^K\) be a \(K\) dimensional vector function
\(x \in A\), and so \(\exists \epsilon \colon B_{\epsilon}(x) \subset A\)
Then if there exists a \((K \times N)\) matrix \(J\) such that
for all converging to zero sequences \(\{h_n\}\), \(h_n \in B_{\epsilon}(0) \setminus \{0\}\), \(h_n \to 0\), then
\(f\) is said to be differentiable at \(x\)
matrix \(J\) is called the total derivative (or Jacobian matrix) of \(f\) at \(x\), and is denoted by \(Df(x)\) or simply \(f'(x)\)
in this definition \(h_n\) is also a converging to zero sequence in \(\mathbb{R}^N\)
the different is that in the numerator we have to use the norm to measure the distance in the \(\mathbb{R}^K\) space
instead of gradient vector \(\nabla f\) we have a \((K \times N)\) Jacobian matrix \(J\)
Similar to how the gradient, Jacobian is composed of partial derivatives
Fact
If a vector-valued function \(f \colon \mathbb{R}^N \to \mathbb{R}^K\) is differentiable at \(x\), then all partial derivatives at \(x\) exist and the total derivative (Jacobian) is given by
note that partial derivatives of a vector-valued function \(f\) are indexed with two subscripts: \(i\) for \(f\) and \(j\) for \(x\)
index \(i \in \{1,\dots,K\}\) indexes functions and rows of the Jacobian matrix
index \(j \in \{1,\dots,N\}\) indexes variables and columns of the Jacobian matrix
Jacobian matrix is \((K \times N)\), i.e. number of functions in the tuple by the number of variables
the gradient is a special case of Jacobian with \(K=1\), as expected!
The sufficient conditions of differentiability are similar for vector-valued functions
Fact
Sufficient conditions for differentiability of a function \(f \colon \mathbb{R}^N \to \mathbb{R}^K\) at \(x\) are that all partial derivatives exist and are continuous in \(x\)
Total derivative of \(f \colon \mathbb{R}^N \to \mathbb{R}^K\) defines a linear map \(J \colon \mathbb{R}^N \to \mathbb{R}^K\) given by \(K \times N\) matrix \(Df(x)\) which gives a affine approximation of function \(f\) by a tangent hyperplane at point \(x\). This is similar to the tangent line to a one-dimensional function determined by a derivative at any given point.
Example
Partial derivatives are
Jacobian matrix is
Now, evaluating at a particular points (bold \({\bf 1}\) and \({\bf 0}\) denotes unit vector and zero vector in \(\mathbb{R}^3\))
Example
Exercise: Derive Jacobian maxtix of \(f\) and compute the total derivative of \(f\) at \(x = (1, 0, 2)\)
Rules of differentiation#
Fact: Sum and scaling
Let \(A\) denote an open set in \(\mathbb{R}^N\) and let \(f, g \colon A \to \mathbb{R}^K\) be differentiable functions at \(x \in A\).
\(f+g\) is differentiable at \(x\) and \(D(f+g)(x) = Df(x) + Dg(x)\)
\(cf\) is differentiable at \(x\) and \(D(cf)(x) = c Df(x)\) for any scalar \(c\)
Example
Jacobian matrix is
Consider the function \(g=f+f \colon \mathbb{R}^3 \to \mathbb{R}^2\). It is given by
Partial derivatives are
Jacobian matrix is
Clearly,
When it comes to the product rule, we have to specify which product is meant:
scalar product: multiplication of vector-valued function with a scalar function
dot product: multiplication of two vector-valued functions resulting in a scalar function
outer product: multiplication of column-vector-valued function and a row-vector-valued function resulting in a full matrix-valued function. The gradient of the matrix-valued function is a 3D array of partial derivatives, so we will skip this
Definition
Outer product of two vectors \(x \in \mathbb{R}^N\) and \(y \in \mathbb{R}^K\) is a \(N \times K\) matrix given by
Note
Potential confusion with products continues: in current context scalar product is a multiplication with a scalar (and not the product that produces a scalar as before in dot product of vectors)
Fact: Scalar product rule
Let
\(A\) denote an open set in \(\mathbb{R}^N\)
\(f \colon A \to \mathbb{R}\), differentiable at \(x \in A\)
\(g \colon A \to \mathbb{R}^K\), differentiable at \(x \in A\)
Then the product function \(fg \colon A \to \mathbb{R}^K\) is differentiable at \(x\) and
Observe that
\(\nabla f(x)\) is a \((1 \times N)\) vector
\(g(x)\) is a \((K \times 1)\) vector
therefore, \(\nabla f(x) g(x)\) is a \((K \times N)\) matrix (outer product!)
\(f(x)\) is scalar
\(Dg(x)\) is a \((K \times N)\) matrix
therefore \(f(x) Dg(x)\) is a \((K \times N)\) matrix
Example
Observe first that
Find all partials
Then the right hand side of the rule is
Exercise: continue to verify that the LHS of the rule gives identical expressions
Fact: Dot product rule
Let \(A\) denote an open set in \(\mathbb{R}^N\) and let \(f,g \colon A \to \mathbb{R}^K\) be both differentiable functions at \(x \in A\). Assume that both functions output column vectors.
Then the dot product function \(f \cdot g \colon A \to \mathbb{R}\) is differentiable at \(x\) and
where \(\square^{T}\) denotes transposition of a vector.
Observe that
both \(Dg(x)\) and \(Df(x)\) are \((K \times N)\) matrices
both \(f(x)^T\) and \(g(x)^T\) are \((1 \times K)\) row-vectors
therefore, both \(f(x)^T Dg(x)\) and \(g(x)^T Df(x)\) are \((1 \times N)\) row-vectors
Example
The dot product of the functions is given by
All the partials are
Now, to verify the rule,
Compositions of multivariate functions#
With multivariate functions, to correctly specify a composition, we have to worry about
domains and co-domains to correspond properly (as in one-dimensional case)
dimensions of the functions to correspond properly
Definition
Let \(A\) denote an open set in \(\mathbb{R}^N\) and \(B\) denote an open set in \(\mathbb{R}^K\)
Let \(f \colon A \to B\), and \(g \colon B \to \mathbb{R}^L\)
Then \(g \circ f\) is a well-defined composition of functions \(g\) and \(f\), and is given by
this definition works for any \(N\), \(K\), and \(L\)
including \(L=1\) corresponding a scalar-valued composition function, which contains vector-valued component
Fact: Chain rule
Let
\(A\) denote an open set in \(\mathbb{R}^N\)
\(B\) denote an open set in \(\mathbb{R}^K\)
\(f \colon A \to B\) be differentiable at \(x \in A\)
\(g \colon B \to \mathbb{R}^L\) be differentiable at \(f(x) \in C\)
Then the composition \(g \circ f\) is differentiable at \(x\) and its Jacobian is given by
Observe that
\(Dg\big(f(x)\big)\) is a \((L \times K)\) matrix
\(Df(x)\) is a \((K \times N)\) matrix
therefore the matrices are conformable for multiplication
the result is a \((L \times N)\) matrix, as expected
Example
Applying the chain rule
Now we can evaluate \(D(g \circ f)(x)\) at a particular points in \(\mathbb{R}^3\)
Example
Exercise: Using the chain rule, find the total derivative of the composition \(g \circ f\).
The chain rule very powerful tool in computing complex derivatives (think backwards propagation in deep neural networks)
Higher order derivatives#
The higher order derivatives of the function in one dimension generalize naturally to the multivariate case, but the complexity of the needed matrixes and multidimensional arrays (tensors) grows fast.
Definition
Let \(A\) denote an open set in \(\mathbb{R}^N\), and let \(f \colon A \to \mathbb{R}\). Assume that \(f\) is twice differentiable at \(x \in A\).
The total derivative of the gradient of function \(f\) at point \(x\), \(\nabla f(x)\) is called the Hessian matrix of \(f\) denoted by \(Hf\) or \(\nabla^2 f\), and is given by a \(N \times N\) matrix
Fact (Youngβs theorem)
For every \(x \in A \subset \mathbb{R}^N\) where \(A\) is an open and \(f \colon A \to \mathbb{R}^N\) is twice continuously differentiable, the Hessian matrix \(\nabla^2 f(x)\) is symmetric
Example
Exercise: check
Example
Exercise: check
For the vector-valued function \(f \colon \mathbb{R}^N \to \mathbb{R}^K\) the first derivative is given by a \((K \times N)\) Jacobian matrix, and the second derivative is given by a \((K \times N \times N)\) tensor, containing \(K\) Hessian matrices, one for each \(f_j\) components. (this course only mentions this without going further)
Multivariate Taylor series#
Consider a function \(f\colon A \subset\mathbb{R}^N \to \mathbb{R}\) differentiable at \(x \in A\)
Introduce multi-index notation given vectors \(\alpha \in \mathbb{R}^N\) and \(x \in \mathbb{R}^N\)
\(|\alpha| \equiv \alpha_1 + \alpha_2 + \dots + \alpha_N\)
\(\alpha! \equiv \alpha_1! \cdot \alpha_2! \cdot \dots \cdot \alpha_N!\)
\(x^{\alpha} \equiv x_1^{\alpha_1} x_2^{\alpha_2} \dots x_N^{\alpha_N}\)
Fact (Clairaut theorem)
If all the \(k\)-th order partial derivatives of \(f\) exist and are continuous in a neighborhood of \(x\), then the order of differentiation does not matter, and the mixed partial derivatives are equal
Therefore the following notation for the up to \(k\) order partial derivatives is well defined
In these circumstances (that Clairaut/Young theorem applies for all \(k = 1,2,\dots,k\)) we say taht \(f\) is \(k\)-times differentiable at \(x\)
Fact
Let \(f\colon A \subset\mathbb{R}^N \to \mathbb{R}\) be a \(k\)-times continuously differentiable function at a point \(a \in \mathbb{R}^N\). Then there exist functions \(h_\alpha \colon \mathbb{R}^N \to \mathbb{R}\), where \(|\alpha|=k\) such that
where \(\lim_{x \to a}h_\alpha(x) = 0.\)
Example
The second-order Taylor polynomial of a smooth function\(f:\mathbb R^N\to\mathbb R\)
Let \(\boldsymbol{x}-\boldsymbol{a}=\boldsymbol{v}\)
Example
The third-order Taylor polynomial of a smooth function\(f:\mathbb R^2\to\mathbb R\)
Let \(\boldsymbol{x}-\boldsymbol{a}=\boldsymbol{v}\)
Source: Wiki
Notes from the lecture#
References and further reading#
References
[Simon and Blume, 1994]: Chapter 8