The Grand Tour
Neural networks are now the default tool for supervised learning problems, sometimes performing at the same level as humans in a number of tasks.
Unfortunately, their decision process is notoriously hard to interpret
To understand a neural network, we often try to observe its action on input examples.
These kinds of visualizations are useful to elucidate the activation patterns of a neural network for a single example, but they offer little insight about the relationship between different examples, different states of the network as it's being trained, or the same example as it is transformed by the different layers of a single network.
Therefore, we should instead aim to visualize the context around our examples of interest. This context is precisely the structure that linear methods preserve well.
To fully understand the neural network process, it is not enough to "open the black box".
We need a visualization that opens the box, but crucially, does not destroy the patterns we are hoping to find.
As a concrete example of how visualizations can fail
We trained deep CNN models with 3 common classification datasets: MNIST, fashion-MNIST and CIFAR-10. Our architecture is simpler in structure than state-of-the-arts but is complex enough to demonstrate the power of the Grand Tour. When training neural networks, we often monitor the training process by plotting the history of loss over training epochs. Taking the training of an MNIST classifier as an example, if we plot the testing loss:
We see it happened twice, before the loss converges, that the curve went flat then dropped (around epoch 14 and 21). What happened? We may simply guess that the optimization went to two local basins, but what caused that?
If we take one step further by plotting the per-class loss:
we learned that the two drops are caused by two particular classes (digit 1 and 7), whose loss curves drop later than the other classes. The model learns different classes at very different times in the training process. Although the network learns to recognize digits 0, 2, 3, 4, 5, 6, 8 and 9 early on, it is not until epoch 14 that it starts successfully recognizing digit 1, or until epoch 21 that it recognizes digit 7.
Perhaps the next question is: were these late learning caused by a subset of input examples or the whole class of digit 1 or 7? To explore this we need to break the per-class losses further down to individual input examples. But instead of looking at losses, we can directly look at the softmax activations. The softmax layer has 10 neurons that can be seen as the (predicted) probability, one for each class. We expect, for example, the 7th (0-indexed) neuron to be mostly fired up for hand written sevens. In fact, since we have access to not only the softmax but also all hidden layers, we can probe the network with input examples to see how it reacts on each layer. In this way we are trying to open the "black box" and understand the internal of a neural network. Here is a typical view of the opened box:
Even though neural networks are capable of incredible feats of classification, deep down, they really are just pipelines of relatively simple functions. In our classifier, the pipeline has convolutional and fully-connected linear layers, and rectified linear unit (ReLU) non-linear layers. Passing an input image through this function pipeline, we have it transformed to different activations patterns in hidden layers until the softmax activations in the end. Most of these simple functions fall into two categories: they are either linear transformations of their inputs (like fully-connected layers or convolutional layers), or relatively simple non-linear functions that work component-wise (like sigmoid activations or ReLU activations). (Some operations, notably max-pooling and softmax, do not fall into either categories. We will come back to this later.) We will later see that this categorization is essential when we try to untangle the complex internal of the neural network. Decomposing the whole pipeline into simpler functions gives details about what happened inside, in particular how neurons activates in different layers. However, in the above view we look at examples one at a time, we lose the full picture: what does the network do to all examples?
The need is simple: we want to look at neuron activations (e.g. in the softmax layer) of all examples at once. If there are only two neurons/dimensions, a scatter plot would suffice. However, the data points in the softmax layer are 10 dimensional, thus we need to either show two dimensions at a time which does not scale well (as there would be 45 scatter plots - a quadric to the dimensionality - to look at), or we can use dimensionality reduction to map the data into a two dimensional space and show them in a single plot.
Consider the aforementioned intriguing feature about the different learning rate that the MNIST classifier has on digit 1 and 7: the network did not learn to recognize digit 1 until epoch 14, digit 7 until epoch 21.
With that in mind, look for changes of digit 1 around epochs 14 and changes of digit 7 around epochs 21 in the visualizations below.
The network behavior is not subtle, every digit in those classes is misclassified until the critical epoch.
Yet, note how hard to spot it in any of the plots below at first glance.
In fact, if inspected carefully, we can see that in the UMAP
The reason that non-linear embeddings fail in spotting the fundamental change in data is that they did not fulfill the principle of data-visual correspondence
Non-linear embeddings that have non-convex objectives also tend to be sensitive to initial conditions.
For example, in MNIST, although the neural network starts to stabilize on epoch 30, t-SNE and UMAP still generate quite different projections between epochs 30 and 99.
But even with spatial regularization
As another example, in fashion-MNIST, take the network's three-way confusion among sandals, sneakers and ankle boots. You can see it most directly in the linear projection (last row, which we found using the Grand Tour and direct manipulation), as the points from those three classes form a triangular frame. T-SNE, in contrast, incorrectly separates the class clusters (possibly because of an inappropriately-chosen hyperparameter). Although UMAP isolates the three classes successfully, the fundamental complexity of non-linear methods makes it hard to know whether this is a feature of data or an artifact in the embedding algorithm. If, as in these cases, a fundamental change in a visualization is not only due to the data (e.g. it is also partially due to the embedding algorithm), this bad design can obscure the data interpretation: we may miss important patterns from reading a bad visualization.
When given the option, then, we should prefer methods for which changes in the data produce predictable, visually salient changes in the result. PCA is a natural way of doing linear dimensionality reduction since it tries to preserve the most variance in the data. The distribution of data from softmax layers, however, is essentially spherical, since activations of examples from different class are orthogonal to each other. As a result, even though PCA projections are interpretable and consistent through training epochs, the first two principal components of softmax activations tend not to be substantially better than the third. So which of them should we show to the user? To make matters worse, different principal components are orthogonal to each other, making it hard for the user to keep them all in their head at once. On the other hand, the Grand Tour is a linear method which also offers us the chance to see many projections smoothly animated.
When combined with some user interactions, we can quickly find projections of interest that highlight important features in the data. In the above view you can drag the circular handle on each axis to move that axis.
To illustrate the late learning of digit 1 and 7 in the MNIST classifier, we find the following linear projection which project the first and and seventh axis on top of the canvas. We can see clearly that digit 1 and 7 get classified correctly starting from epoch 14 and 21, thanks to the data-visual correspondence of linear projections.
Now, let us take a closer look at the Grand Tour technique itself.
The Grand Tour
Mathematically, the Grand Tour provides us a sequence of pD-to-2D projections that let us project the p-dimensional data onto different 2D subspaces.
This sequence/path of projection can be seen as a mapping from the time variable
We have to emphasis one favorable property of the Grand Tour in visualizing neural networks: it is invariant about basis chosen for the data representation. Because the Grand Tour itself is a change of basis, every solid rotation in data space has a corresponding Grand Tour matrix that result in the same view. In the mean time, linear transformations in neural networks are also invariant under change of basis. Every linear transformations can be uniquely decomposed as a three-step action: rotation, projection followed by another rotation. We will later show that this connection suggests a simplification of all linear transformations in the Grand Tour view.
The Grand Tour is a general technique for any high dimensional data points, so in principle we should be able to use it to view neuron activations in any layers.
However, the final softmax layer is especially easy to understand because its axes have strong semantics - the
First, the Grand Tour of the softmax layer let us qualitatively judge the performance of our model. In addition, since we used similar architecture on three datasets, the performance also reflects the relative complexity of the dataset. We can see that data points are most clustered for the MNIST dataset, the digits are close to one of the ten corners of the softmax space, while fashion-MNIST or CIFAR-10, the separation were not as extreme and more points live inside the volume.
In fashion-MNIST, we see how the model confuses between classes. For example, the model confuses among sandals, sneakers and ankle boots, as we see data points form a triangular frame in the softmax layer. The In the figure below, drag any handle on the axes to change the projection:
Examples fall between sandal and sneaker classes indicates that the model has low confidence in distinguishing the two, same thing happens between sneaker and ankle boot classes. But the model has less confusion between sandal and ankle boot classes, as not many examples fall between these two classes. Moreover, all but one example falls in the interior of the triangle, while others live close to the boundary. This tells us that most confusions happen between two out of the three classes, they are really two-way confusions. Within the same dataset, we can also see a triangular plane filled by pullovers, coats and shirts. This is different from the sandal-sneaker-ankle-boot case, as examples not only fall on the boundary of a triangle, but also in its interior: a true three-way confusion. Similarly, in the CIFAR-10 dataset we can see confusion between dogs and cats, airplanes and ships. The mixing pattern in CIFAR-10 is not as clear as in fashion-MNIST, because a lot of examples are misclassified.
We will explain how our direct manipulation works in technical details.
We trained 99 epochs and recorded the entire history of neuron activations on subsets of training and testing examples. In the beginning when neural network first randomly initialized, all examples are placed around the origin of softmax space. Through training, examples are pushed to their correct "class corners" in the softmax space.
Comparing the different behavior on training and testing dataset confirmed our hypothesis about dataset difficulty before, when we only looked at the final epoch. In the MNIST dataset, the trajectory of testing images through training is consistent with the training set. Data points went directly toward the corner of its true class and stabilized after 50 epochs. On the other hand, in CIFAR-10 there is an inconsistency between the training and testing sets. Images from the testing set keep oscillating while most images from training converges to the corresponding class corner. This signals that the model overfits the training set and thus does not generalize well to the testing set.
Given the presented techniques, we can in principle inspect any intermediate layer of a neural network. Despite that, two problems need to be solved first:
We address the first problem by allowing users to directly manipulate instances with the argument that instances have better semantics than neurons (we describe this in the technical details section.)
To tackle the second, instead of looking at layers side by side, we shall see the transformation as an action on a layer and look at how this action moves data points through animation.
One direct approach, is to embed the low dimensional layer into the higher layer (by appending zeros) and then linearly interpolate the source
Note that here we make an explicit distinction between data and its representation to help us think about the true signal in the data that is hidden in its specific representation. Order of neurons in a layer is an artifact of the specific model, or a specific representation of the model, since other models that only permute the neurons should be thought as the same model. When designing a visualization (vis), we should think about which part of the data representation is an artifact that we should denoise, and which contains the real knowledge we should distill from this representation. In this case, we want our vis to equivalent the ordering of neurons/dimensions in data points because the ordering is not important. Conversely, when judging a vis, we should think about what this vis emphasizes about the data and what it discards. That is, we think of what the Grand Tour sees equivalent and whether it is aligned with our goals.
To see whether the Grand Tour is a desired vis in viewing a linear transformation, we take advantage of a central, amazing fact of linear algebra.
The Singular Value Decomposition (SVD) theorem shows that any linear transformation can be decomposed into a sequence of very simple operations: a rotation, a scaling, and another rotation.
Applying a matrix
Since we have activations of all layers, to reduce the data size, we cut the number of data points to 500 and epochs to 50. Despite less points and epochs are shown, now we are able to trace a behavior we observed in softmax back to any layers. In fashion-MNIST, for example, we observe a separation of shoes (sandals, sneakers and ankle boots as a group) from all other classes in the softmax layer. Tracing it back to earlier layers, we can see that this separation happened as early as layer 5:
Unlike the softmax layer, there is one problem with our axis handles in high dimensional hidden layers.
Although we have preprocessed data (we chose up to 45 principle components through PCA) so that their dimensionality is at manageable size for the Grand Tour, the axes lines still make a hair ball, and more importantly, these axes in hidden layers do not possess as much meaning as the softmax.
Unlike the softmax or pre-softmax where each axis represents a label class, in hidden layers, one may need some indirect ways (e.g. feature visualization
The Grand Tour can also elucidate adversarial examples
Let us examine how adversarial examples evolved to fool the network:
We start from 89 examples of digit 8 (black dots in the plot above) and gradually add adversarial noise to these images, using the fast gradient sign method
What is more interesting is what happens in the intermediate layers. In pre-softmax, for example, we see that these fake 0s behave differently from the genuine 0s: they live closer to the decision boundary of two classes and form a plane by themselves.
Previously, we compared the Grand Tour with other dimensionality reduction techniques. We used small multiples of random projections to serve as key frames of Grand Tour. The trade-offs between small multiples and animations is an ongoing discussion in the vis community. Here, we value the animation more than multiple static plots, for it eliminate certain kind of misleaders in the vis and thus gives a better data-visual correspondence. In a linear projection, points that form a straight line in the original space are guaranteed to form a straight line in its low dimensional projection. However, since the mapping is from a high dimension to a lower one, there can be false positives - a straight line on screen may not correspond to a real straight line in original space - the line being shown may be an illusion caused by this the lossy mapping. With static plots, this potential illusion can never be eliminated, since we only see one projection at a time. With animations, a viewer can visually infer and confirm the existence of a line in high dimensional space by looking at multiple frames of different projection. The longer a line persist in the animated projection, the higher chance that it is indeed a straight line in its original space. The smoothness animation gives the viewer an easier way to link points from view to view, compared to presenting the same thing with a sequence of snapshots. Therefore animation in this context enables a form of visual hypothesis testing, with the null hypothesis being "the line I see is created merely by the projection instead of data". As you may be convinced by the figure in the the comparison section, it is harder to recognize straight lines with high confidence with static plots than with Grand Tour animation, not to mention more complicated structures such as curves and clusters.
In our work we have walked through models that are purely sequential. In modern architectures it is common to have non-sequential parts such as small ephemeral highway
Modern architectures are also wide. Especially when convolutional layers are concerned, one could run into issues with scalability if we see such layers as a large sparse matrix acting on flattened multi-channel images.
For the sake of simplicity, in this article we brute-forced the computation of the alignment of such convolutional layers by writing out their explicit matrix representation.
However, the singular value decomposition of multi-channel 2D convolutions can be computed efficiently
Although the Grand Tour is not a perfect solution for high-dimensional data visualization, the examples above highlight the importance of ensuring that the properties of the phenomenon we want to visualize are in fact respected by the visualization method. As powerful as t-SNE and UMAP are, they often fail to offer the correspondences we need, and such correspondences can come, surprisingly, from relatively simple methods like the ones we presented.
In the interest of completeness, the remainder of the article provides a number of technical details about the implementation of the techniques we described above.
Our direct manipulation has two modes: on the axes and on data points.
The math behind the two is simple.
Convention-wise, in this note we write vectors in rows and (matrices of) linear transformations are applied on its right.
This makes composition of (linear) functions read easily from left to right.
Note this is also a common practice when programming with data, since tabular data is often stored in rows.
Our goal is to change the Grand Tour view
Note the projection of standard basis
So intuitively, the new matrix can be found by copying the old one and add to its two entries by
Two problems need clarification:
First, there are infinitely many solutions for such
As we have said, when the dimensionality is high, manipulating the viewing perspective by axis handles is complicated. A better and more intuitive way is probably by dragging (centroids of) data points, instead of dragging axis handles. Seeing unit vectors along axis directions as special data points, we will see this scheme is a generalization of the previous one and has a nice geometric interpretation: a dragging causes rotation in a 2-subspace.
We want to rotate the plane which is spanned by the two centroids before and after user's movement. Let
Another to formulate the direct manipulation is through Procrustes problems. When dragging, the user wants to translate each selected points in the direction of dragging, while keeping the other non-selected points fixed. We can formulate this as an optimization over orthogonal transforms. Suppose we are looking for a rotation that aligns the old view of data points to a new view in which selected points are translated. This is a typical objective in (orthogonal) Procrustes problems, which has a closed form solution. If we discard the part on the hidden dimensions which is not shown on canvas, the problem is called projection Procrustes, which has an iterative algorithmic solution. We will explain the orthogonal version first.
Before any formula we encourage our readers to go back to our interactive demo about layer dynamics and experience the difference between these methods.
To formalize the orthogonal Procrustes problem, let
Instead of looking for full
Each iteration of step 2 will generate a new rotation matrix. The sequence of these matrices will be multiplied to the right of the current Grand Tour matrix. Also interesting is that the first orthogonal Procrustes iteration will behave similar to PCA, since it tries to minimize the norm in last p-d dimensions. We will demonstrate that this operation is also useful for data exploration in Grand Tour.
Although the above algorithm is slow in our implementation, interestingly, if we only compute one iteration, its behavior is similar to doing PCA on the subset of points user brushed. We found this operation alone very useful when points are clustered and unable to be separated by other methods.
As noted before, we preprocessed data so that the neuron activations in one layer is aligned with another. This preprocessing significantly helps us in understanding the essence of neural networks. Without this, there will be a lot of unnecessary cross-the-origin and switching axes shown in the transition between layers.
We will first see this in a case where the number of neurons lying in between a linear map are the same, later we extend this to cases where the number differs in two layers.
Suppose we want to look at a linear transformation in Grand Tour.
Data that are stored in rows vectors of
One may need matrix exponentials to compute intermediate steps for the transitions
Consider a p-dimensional case where the linear transformation
Given these considerations, we want to reduce the rotation-scaling-rotation operation and shows its scaling component only.
Note that by definition,
In summary, for any linear map
In this transition, linear interpolation between
Some extra care regarding to both understanding and implementation must be taken when the dimensionality of two layers differ.
In a linear map
When
When
Our presentation of layer-to-layer visualization implicitly assumed a setting in which only two layers were involved.
When animating more than one transition at once, we need to deal with the discrepancy between the pairwise alignments.
This problem only arises when composing linear operations. Non-linear layers, as we discussed above, have natural alignments.
For example, when dealing with three operators
When convolutional layers are concerned, one can represent multi-channel 2D convolutions as a large and sparse, doubly block circulant
The utility code for WebGL under js/lib/webgl_utils/ are adapted from Angel's computer graphics book supplementary: https://www.cs.unm.edu/~angel/BOOK/INTERACTIVE_COMPUTER_GRAPHICS/SEVENTH_EDITION/