notes on the essence of linear algebra

The almost genius playlist from 3blue1brown.

vector

$v \in R^{n}$ has a mathematically rigorous definition, but for computer science, treat it as a column or a row of numbers. Each number represents a scalar along a dimension, ordered by your basis.

So in $R^{2}$ , $[23] = 2 \hat{i} + 3 \hat{j}$ .

the origin is fixed of course at $(0, 0)$ , and this holds for any coordinate system as long as we are not translating the origin.

basis

What is the direction and scale of “1 unit”, as I choose to define it? Default is just to use orthogonal $\hat{i}$ , $\hat{j}$ , and $\hat{k}$ in the default grid sense.

linearity

addition

$[a b] = a \hat{i} + b \hat{j}$ means $a$ steps in the $\hat{i}$ direction and then $b$ steps in the $\hat{j}$ direction.

$c = a + b \Rightarrow c_{x} = a_{x} + b_{x}, c_{y} = a_{y} + b_{y}$ .

scalar

$v \mapsto k v$ means multiply length by $k$ , in the same direction.

These two conditions make linearity. Formally, for a transformation $L$ :

$L (x + y) = L (x) + L (y)$
$L (k x) = k L (x)$

A visual test is that an evenly spaced line of points remains evenly spaced after $L$ .

note that origin will remain fixed

span

Defined in terms of $v$ and $w$ .

$span (v, w) = {a v + b w ∣ a, b \in R}$ .

What surface do the possible ends draw?

matrices

Say I want to define a transformation $L$ on $\hat{i}$ and $\hat{j}$ :

A = [L (\hat{i}) L (\hat{j})]

The columns represent the new positions of $\hat{i}$ and $\hat{j}$ after the transform.

The effect can be a shear, scale, rotation, etc.

What would my original vector $v$ look like? $A v$ is the final position of the end.

You can think like this: if my $\hat{i}$ is now $2 x$ , I move double the steps. If my $\hat{j}$ is rotated, I do the same on $v$ .

What is interesting is:

$v = [a b] = a \hat{i} + b \hat{j}$ .

After the transform, the same scalars still hold:

$A v = a L (\hat{i}) + b L (\hat{j})$ .

For example, if $A = [2345]$ , then $A v = a [23] + b [45]$ .

And if you forget “vectors” and treat it like matrix multiplication, the effect is constructed to be the same.

linear independence

Defined for a set of vectors.

You can ask: does it add anything to the span? If the span remains the same even if I remove it, then it can be constructed by some linear combination of the remaining vectors.

That is, $v \in span (S ∖ {v})$ , so it is linearly dependent.

basis

Defined for a space, say $R^{2}$ or $R^{n}$ .

It is the linearly independent set of vectors that span it.

dimension collapse

Say I use a transformation $A : R^{2} \to R^{2}$ where $L (\hat{j})$ is linearly dependent on $L (\hat{i})$ .

A trivial one is $\hat{j}$ same as $\hat{i}$ :

A = [1010], A [x y] = [x + y 0]

So the second dimension is gone: for every input, $A (x, y) \in span (\hat{i})$ .

The span, which for a basis was a plane, is just a line after transformation. Note that this is not a basis change; we still describe scalars in terms of $\hat{i}$ and $\hat{j}$ .

Think of it as one-way loss of information. I cannot uniquely go back from something on the line to what $(x, y)$ was before the transformation. Just like $x \cdot 0 = 0$ , I have no way to know what $x$ could have been, except “anything”.

compositions

A transformation can be a scale, shear, rotation, reflection, projection, or a composition.

In general, $L_{1} L_{2} \neq = L_{2} L_{1}$ .

To read it, go right to left: $L_{1} L_{2} (v) = L_{1} (L_{2} (v))$ .

determinant

Under a transformation $L$ , area in 2D, or volume in higher dimensions, scales by $∣ det (L) ∣$ .

Negative determinant implies orientation flip.

Think of a dimension collapse from some area in 2D to a line with 0 area. This is what it means to have a collapse when $det (L) = 0$ .

For higher dimensions, collapse can be in multiple dimensions. Say $\hat{j}$ and $\hat{k}$ collapse to 0, so you have a line as span, or all collapse to 0, so just a point remains as span.

Now if you think of it as area or volume scaling, then clearly:

$det (A) det (B) = det (A B)$ .

Visually, $A$ then $B$ is the same as $A B$ in compositions.

linear system of equations

The good case has $n$ equations and $n$ variables. Say we have 2:

$A x = b$ .

Think of $x$ as the vector that I want to find. Then $A$ is a transformation defined by constants, and $b$ is the resultant vector.

So a linear equation is: what transformation applied to $x$ makes it the same as $b$ ?

The inverse reverses the $A$ transform:

$x = A^{- 1} b$ .

Except, if $A$ collapses everything to a lower dimension, $det (A) = 0$ , then we can never say for sure what the original was.

If $b \in Col (A)$ , there are infinite solutions. If $b \in / Col (A)$ , there is no solution.

rank

$rank (A) = dim (Col (A))$ .

If the transform collapses dimensions, rank decreases.

Say rank decreases by $Δ$ . If $Δ = 2$ , then a whole plane has collapsed to a point.

column space

Of a transform $A$ , the column space is:

$Col (A) = span {columns of A}$ .

There must be some redundancy / linear dependence in the columns to collapse dimensions.

null space/kernel

Say dimensions do collapse. Then the set of vectors that got collapsed to 0 is the null space:

$Null (A) = {v ∣ A v = 0}$ .

non-square matrices

It is very much possible to apply a transformation that changes dimensions:

$A : R^{n} \to R^{m}$ .

I can define a linear transform $P$ as projection onto the x-axis:

P : R^{2} \to R, P [x y] = [10] [x y] = x

Any higher-to-lower dimension transform cannot be reversed uniquely: $n > m \Rightarrow A : R^{n} \to R^{m}$ is not one-to-one.

What $A$ as a transform shows:

A \in R^{m \times n} \Rightarrow {n = dimension converting from m = dimension converting to

dual

A dual vector is a linear transform to a scalar:

$f : R^{n} \to R$ .

For a column vector in $R^{n}$ , the dual is a row of $n$ elements. The operation is just matrix multiplication to a $1 \times 1$ scalar:

[a_{1} a_{2} \dots a_{n}] x_{1} x_{2} ⋮ x_{n} = i = 1 \sum n a_{i} x_{i}

The idea of a dual is that it lets you define, for say 2D, $f (x, y) = a x + b y$ , which can extract info from the vector as a linear map.

Take dot product with another vector $w = (a, b)$ . Then $f_{w} (x, y) = a x + b y$ . The transform is tied to $w$ here; it gives a signed scalar measurement along $w$ .

I feel the idea is pretty redundant

dot product

For $v, w$ , the dot product is:

$v \cdot w = ∥ v ∥ ∥ w ∥ cos θ$ .

The scalar that it gives is the dot product.

Now $v \cdot w = w \cdot v$ .

Why? Assume both are the same length. You can make a line bisecting the angle between them. In such a scenario, both $v$ and $w$ are interchangeable.

Now if $v$ and $w$ have different lengths, consider $v = k v^{'}$ , such that $v^{'}$ and $w$ are the same length as above.

You get the same symmetry, with the extra factor $k$ put back. The idea visually explains dot product commutativity.

Now you can use the above dual idea to define dot as well.

Instead of dot of 2 vectors, define dot as a dual tied to one vector and applied to the other:

$f_{w} (v) = w^{⊤} v$ .

again a bit redundant idea unless we build on top

cross product

Geometrically, $∥ v \times w ∥$ is the area of the parallelogram enclosed by $v$ and $w$ .

Though it is defined as a vector being the result.

In such a scenario you can technically say that cross product is not defined in 2 dimensions as a vector in the same space, since the result needs to be perpendicular and there is no space in 2 dimensions.

In 3D, it gives a perpendicular vector with magnitude equal to the area. Which orientation is then convention.

some intuition on the calculations

You can think of $x \times y$ as giving an area-oriented vector.

If you dot another vector $z$ with it, it gives the volume of the 3 vectors $x, y, z$ :

$z \cdot (x \times y) = det (z, x, y)$ .

Now say I want to find that perpendicular vector and I have $x$ and $y$ .

I can define a 3D to 1D transform as:

$f (z) = det (z, x, y)$ .

Magnitude wise, this determinant is the same as dot of $z$ and the cross product $p$ :

$f (z) = z \cdot p$ , where $p = x \times y$ .

Now if you dot and compare coefficients:

alt text

and compare coefficients, you get the actual cross product.

cramer’s rule

the good case

Say we have 2 equations and 2 unknowns:

$A x = b$ .

$A$ is a $2 \times 2$ transform.

Geometrically it is: what transform do I apply on $x$ to make it the same as $b$ ?

Now you can take an inverse and so on.

But there is a different argument too.

Initially I have $x = [x y]$ .

Say I want to find the x-coordinate.

Area of $x$ and $\hat{j}$ :

$det (x, \hat{j}) = x$ .

Area after transform: $\hat{j} \mapsto A \hat{j}$ , and $x \mapsto b$ , so area is now $det (b, A \hat{j})$ .

But this area must have been scaled by $det (A)$ :

$x det (A) = det (b, A \hat{j})$ .

So:

$x = \frac{det ( b , A j ^ )}{det ( A )}$ .

change of basis

When we apply a transform $A$ on $x$ that doubles $\hat{i}$ , the resulting vector is $2 x$ in the x direction because our basis is still the same.

Basis is how we define our vector relative to.

Had I scaled my basis, $\hat{i}^{'} = 2 \hat{i}$ , then my vector would be the same.

A vector is just scalars saying move $a$ units in basis 1 direction, then $b$ units in basis 2 direction, and so on:

$a b_{1} + b b_{2} + \dots$ .

Denote $B$ as the change-of-basis matrix. For a transform $A$ , this is a bit reverse.

If $x_{theirs} = B x_{mine}$ and $T$ is defined in their basis, then:

$x_{mine}^{'} = B^{- 1} TB x_{mine}$ .

So the effective transform in mine is:

$B^{- 1} TB$ .

eigenvectors and eigenvalues

Under any transform $A$ , there may exist certain vectors that only change in scale:

$A v = λ v$ .

Such vectors are called eigenvectors, and the scale change is the eigenvalue. So eigenvectors and eigenvalues are closely paired.

Move everything to one side:

$A v - λ I v = 0$ .

So:

$(A - λ I) v = 0$ .

For this to have a nonzero solution, the determinant must be 0:

$det (A - λ I) = 0$ .

Once you get $λ$ , substitute and solve $(A - λ I) v = 0$ .

Exclude $v = 0$ .

The span of the eigenvectors is the eigenspan. If that eigenspan also spans the current space, you can choose an eigenbasis from it.

Under eigenbasis, each component of the basis is transformed separately by simple scaling by eigenvalue.

So if it exists, the $A$ transform would just be a diagonal matrix with eigenvalues on the diagonal.

This is called diagonalisation.

Say I want to calculate $A^{100}$ now. Under eigenbasis, that is cheap:

$D^{100} = [λ_{1}^{100} 0 0 λ_{2}^{100}]$ .

You can combine that with change of basis:

$B^{- 1} D^{100} B$ .

abstracting out the geometry

You can think of “axioms” as an interface. I define some constraints, here linearity:

$L (x + y) = L (x) + L (y)$ and $L (k x) = k L (x)$ .

If whatever you have follows that, you can do all sorts of things on that.

For example, polynomials and derivatives can fit into the same linearity framework.

It will not be possible to visualise polynomials just as easily as lines and planes, but everything still holds. This is abstracting out the geometry.

Lest I Forget

notes on the essence of linear algebra

vector

basis

linearity

addition

scalar

span

matrices

linear independence

basis

dimension collapse

compositions

determinant

linear system of equations

rank

column space

null space/kernel

non-square matrices

dual

dot product

cross product

cramer’s rule

change of basis

eigenvectors and eigenvalues

abstracting out the geometry