Neural networks are now the default tool for supervised learning problems, sometimes performing at the same level as humans in a number of tasks. Unfortunately, their decision process is notoriously hard to interpret , and their training process is often hard to debug. We present a method to visualize the responses of a neural network using a somewhat-forgotten visualization method. The Grand Tour works by generating a random, smoothly changing rotation of the dataset, and then projecting the data to the two-dimensional screen. This is a linear process. The Grand Tour works well for visualizing neural networks because, as we will see, neural networks are also (mostly) linear transformations from inputs to outputs. We will find this data-visual correspondence crucial to the visualization (vis) design, especially when compared to other non-linear methods.

To understand a neural network, we often try to observe its action on input examples. These kinds of visualizations are useful to elucidate the activation patterns of a neural network for a single example, but they offer little insight about the relationship between different examples, different states of the network as it's being trained, or the same example as it is transformed by the different layers of a single network. Therefore, we should instead aim to visualize the context around our examples of interest. This context is precisely the structure that linear methods preserve well. To fully understand the neural network process, it is not enough to "open the black box". We need a visualization that opens the box, but crucially, does not destroy the patterns we are hoping to find. As a concrete example of how visualizations can fail , if our visualization is not consistent throughout a sequence of training epochs, then changes in the visualization might not be due to any meaningful change in the network itself. Just as importantly, changes in the network might not be reflected in the visualization.

Background

Opening the black box

We trained deep CNN models with 3 common classification datasets: MNIST, fashion-MNIST and CIFAR-10. Our architecture is simpler in structure than state-of-the-arts but is complex enough to demonstrate the power of the Grand Tour. When training neural networks, we often monitor the training process by plotting the history of loss over training epochs. Taking the training of an MNIST classifier as an example, if we plot the testing loss:

We see it happened twice, before the loss converges, that the curve went flat then dropped (around epoch 14 and 21). What happened? We may simply guess that the optimization went to two local basins, but what caused that?

we learned that the two drops are caused by two particular classes (digit 1 and 7), whose loss curves drop later than the other classes. The model learns different classes at very different times in the training process. Although the network learns to recognize digits 0, 2, 3, 4, 5, 6, 8 and 9 early on, it is not until epoch 14 that it starts successfully recognizing digit 1, or until epoch 21 that it recognizes digit 7.

Perhaps the next question is: were these late learning caused by a subset of input examples or the whole class of digit 1 or 7? To explore this we need to break the per-class losses further down to individual input examples. But instead of looking at losses, we can directly look at the softmax activations. The softmax layer has 10 neurons that can be seen as the (predicted) probability, one for each class. We expect, for example, the 7th (0-indexed) neuron to be mostly fired up for hand written sevens. In fact, since we have access to not only the softmax but also all hidden layers, we can probe the network with input examples to see how it reacts on each layer. In this way we are trying to open the "black box" and understand the internal of a neural network. Here is a typical view of the opened box:

Neural network opened. Our classifier has linear components (conv and linear) and component-wise non-linear functions (ReLU)

Even though neural networks are capable of incredible feats of classification, deep down, they really are just pipelines of relatively simple functions. In our classifier, the pipeline has convolutional and fully-connected linear layers, and rectified linear unit (ReLU) non-linear layers. Passing an input image through this function pipeline, we have it transformed to different activations patterns in hidden layers until the softmax activations in the end. Most of these simple functions fall into two categories: they are either linear transformations of their inputs (like fully-connected layers or convolutional layers), or relatively simple non-linear functions that work component-wise (like sigmoid activations or ReLU activations). (Some operations, notably max-pooling and softmax, do not fall into either categories. We will come back to this later.) We will later see that this categorization is essential when we try to untangle the complex internal of the neural network. Decomposing the whole pipeline into simpler functions gives details about what happened inside, in particular how neurons activates in different layers. However, in the above view we look at examples one at a time, we lose the full picture: what does the network do to all examples?

Dimensionality reduction

The need is simple: we want to look at neuron activations (e.g. in the softmax layer) of all examples at once. If there are only two neurons/dimensions, a scatter plot would suffice. However, the data points in the softmax layer are 10 dimensional, thus we need to either show two dimensions at a time which does not scale well (as there would be 45 scatter plots - a quadric to the dimensionality - to look at), or we can use dimensionality reduction to map the data into a two dimensional space and show them in a single plot.

But state-of-the-art dimensionality reduction is non-linear!

Consider the aforementioned intriguing feature about the different learning rate that the MNIST classifier has on digit 1 and 7: the network did not learn to recognize digit 1 until epoch 14, digit 7 until epoch 21. With that in mind, look for changes of digit 1 around epochs 14 and changes of digit 7 around epochs 21 in the visualizations below. The network behavior is not subtle, every digit in those classes is misclassified until the critical epoch. Yet, note how hard to spot it in any of the plots below at first glance. In fact, if inspected carefully, we can see that in the UMAP (last row), the digit 1 which cluttered in the bottom in epoch 13 becomes a new tentacle in epoch 14. Use the highlight buttons to help yourself spot the change.

Softmax activations of the MNIST classifier with non-linear dimensionality reduction

The reason that non-linear embeddings fail in spotting the fundamental change in data is that they did not fulfill the principle of data-visual correspondence : we design visualizations to reveal signals in data, so that a change in data is reflected in its visualization (vis) correspondingly. Ideally, we want the changes in two parts (data and its visualization) to match in magnitude: a barely noticeable change in vis reflects the smallest possible change in data, and a salient change in vis reflects a significant one in data. Here in non-linear embeddings, a significant change happened in only a subset of data (e.g. all points of digit 1 from epoch 13 to 14), but all points in the vis move dramatically. This is because the position of each single point depends on the whole data distribution in such embedding algorithms. Non-linear projections is thus not ideal for visualization. The lack of data-visual correspondence make it hard to infer the underlying change in data from the vis.

Non-linear embeddings that have non-convex objectives also tend to be sensitive to initial conditions. For example, in MNIST, although the neural network starts to stabilize on epoch 30, t-SNE and UMAP still generate quite different projections between epochs 30 and 99. But even with spatial regularization , these non-linear methods suffer from interpretability issues .

As another example, in fashion-MNIST, take the network's three-way confusion among sandals, sneakers and ankle boots. You can see it most directly in the linear projection (last row, which we found using the Grand Tour and direct manipulation), as the points from those three classes form a triangular frame. T-SNE, in contrast, incorrectly separates the class clusters (possibly because of an inappropriately-chosen hyperparameter). Although UMAP isolates the three classes successfully, the fundamental complexity of non-linear methods makes it hard to know whether this is a feature of data or an artifact in the embedding algorithm. If, as in these cases, a fundamental change in a visualization is not only due to the data (e.g. it is also partially due to the embedding algorithm), this bad design can obscure the data interpretation: we may miss important patterns from reading a bad visualization.

Linear methods to the rescue

When given the option, then, we should prefer methods for which changes in the data produce predictable, visually salient changes in the result. PCA is a natural way of doing linear dimensionality reduction since it tries to preserve the most variance in the data. The distribution of data from softmax layers, however, is essentially spherical, since activations of examples from different class are orthogonal to each other. As a result, even though PCA projections are interpretable and consistent through training epochs, the first two principal components of softmax activations tend not to be substantially better than the third. So which of them should we show to the user? To make matters worse, different principal components are orthogonal to each other, making it hard for the user to keep them all in their head at once. On the other hand, the Grand Tour is a linear method which also offers us the chance to see many projections smoothly animated.

The Grand Tour of the softmax layer. Drag the handle on each axis to move that axis.

When combined with some user interactions, we can quickly find projections of interest that highlight important features in the data. In the above view you can drag the circular handle on each axis to move that axis.

To illustrate the late learning of digit 1 and 7 in the MNIST classifier, we find the following linear projection which project the first and and seventh axis on top of the canvas. We can see clearly that digit 1 and 7 get classified correctly starting from epoch 14 and 21, thanks to the data-visual correspondence of linear projections.

The Grand Tour

The Grand Tour is a technique for high dimensional data visualization. It randomly rotates data points in a smooth manner, and shows its first two dimensions. Here are some examples of the Grand Tour:

Mathematically, the Grand Tour provides us a sequence of pD-to-2D projections that let us project the p-dimensional data onto different 2D subspaces. This sequence/path of projection can be seen as a mapping from the time variable t to a composition of a p-by-p rotation matrix, GT \in SO(p), and a projection onto 2D, \pi_2 = [e_1, e_2]: t \mapsto \pi_2 \circ GT The construction of the rotations guarantees the path to be smooth (i.e. infinitely differentiable) and thorough (that the curve is space-filling). Under this construction and given enough time, people have chance to see every possible 2D projections smoothly animated. We will cover more detail of how to generate the sequence in the supplementary material.

We have to emphasis one favorable property of the Grand Tour in visualizing neural networks: it is invariant about basis chosen for the data representation. Because the Grand Tour itself is a change of basis, every solid rotation in data space has a corresponding Grand Tour matrix that result in the same view. In the mean time, linear transformations in neural networks are also invariant under change of basis. Every linear transformations can be uniquely decomposed as a three-step action: rotation, projection followed by another rotation. We will later show that this connection suggests a simplification of all linear transformations in the Grand Tour view.

The Grand Tour of the softmax layer

The Grand Tour is a general technique for any high dimensional data points, so in principle we should be able to use it to view neuron activations in any layers. However, the final softmax layer is especially easy to understand because its axes have strong semantics - the i^{th} axis corresponds to network's confidence on the i^{th} class label. For other intermediate hidden layers, the axes are more abstract. Because of this strong semantics in softmax layer, we first look at the Grand Tour in this layer.

The Grand Tour of softmax layer in the last (99^th) epoch, with MNIST, fashion-MNIST or CIFAR-10 dataset.

First, the Grand Tour of the softmax layer let us qualitatively judge the performance of our model. In addition, since we used similar architecture on three datasets, the performance also reflects the relative complexity of the dataset. We can see that data points are most clustered for the MNIST dataset, the digits are close to one of the ten corners of the softmax space, while fashion-MNIST or CIFAR-10, the separation were not as extreme and more points live inside the volume.

Direct manipulation on the axes

In fashion-MNIST, we see how the model confuses between classes. For example, the model confuses among sandals, sneakers and ankle boots, as we see data points form a triangular frame in the softmax layer. The In the figure below, drag any handle on the axes to change the projection:

This linear projection clearly shows model's confusion among sandals, sneakers, and ankle boots. Similarly, this projection shows the true three-way confusion about pullovers, coats, and shirts. (The shirts are also get confused with t-shirts/tops. ) Both projections are found by direct manipulations.

Examples fall between sandal and sneaker classes indicates that the model has low confidence in distinguishing the two, same thing happens between sneaker and ankle boot classes. But the model has less confusion between sandal and ankle boot classes, as not many examples fall between these two classes. Moreover, all but one example falls in the interior of the triangle, while others live close to the boundary. This tells us that most confusions happen between two out of the three classes, they are really two-way confusions. Within the same dataset, we can also see a triangular plane filled by pullovers, coats and shirts. This is different from the sandal-sneaker-ankle-boot case, as examples not only fall on the boundary of a triangle, but also in its interior: a true three-way confusion. Similarly, in the CIFAR-10 dataset we can see confusion between dogs and cats, airplanes and ships. The mixing pattern in CIFAR-10 is not as clear as in fashion-MNIST, because a lot of examples are misclassified.

This linear projection clearly shows model's confusion between cats and dogs. Similarly, this projection shows the confusion about airplanes and ships. Both projections are found by direct manipulations.

The Grand Tour of training dynamics

We trained 99 epochs and recorded the entire history of neuron activations on subsets of training and testing examples. In the beginning when neural network first randomly initialized, all examples are placed around the origin of softmax space. Through training, examples are pushed to their correct "class corners" in the softmax space.

Comparing the different behavior on training and testing dataset confirmed our hypothesis about dataset difficulty before, when we only looked at the final epoch. In the MNIST dataset, the trajectory of testing images through training is consistent with the training set. Data points went directly toward the corner of its true class and stabilized after 50 epochs. On the other hand, in CIFAR-10 there is an inconsistency between the training and testing sets. Images from the testing set keep oscillating while most images from training converges to the corresponding class corner. This signals that the model overfits the training set and thus does not generalize well to the testing set.

With this view of CIFAR-10 , the color of points are more mixed in testing (right) than training (left), showing an over-fitting in the training process. Also see this comparison in MNIST or fashion-MNIST, where there is less difference between training and testing set.

The Grand Tour of layer dynamics

Given the presented techniques, we can in principle inspect any intermediate layer of a neural network. Despite that, two problems need to be solved first:

We address the first problem by allowing users to directly manipulate instances with the argument that instances have better semantics than neurons (we describe this in the technical details section.)

To tackle the second, instead of looking at layers side by side, we shall see the transformation as an action on a layer and look at how this action moves data points through animation. One direct approach, is to embed the low dimensional layer into the higher layer (by appending zeros) and then linearly interpolate the source x_0 and destination x_1 as time t goes from 0 to 1. x(t) = (1-t) \cdot x_0 + t \cdot x_1 However, this transition fails to meet our principle of data-visual correspondence in a certain way: it complicates a simple change in data. Consider an linear action that only permutes all neurons in a layer. In a neural network, this transformation should be consider as doing nearly nothing, for it only changes the order of neurons and affects only the wiring of weights to the downstream layers. However, when visualizing this permutation by the linear interpolation above, all points move dramatically, while the distribution of points (in a coordinate-free space) is unchanged.

Note that here we make an explicit distinction between data and its representation to help us think about the true signal in the data that is hidden in its specific representation. Order of neurons in a layer is an artifact of the specific model, or a specific representation of the model, since other models that only permute the neurons should be thought as the same model. When designing a visualization (vis), we should think about which part of the data representation is an artifact that we should denoise, and which contains the real knowledge we should distill from this representation. In this case, we want our vis to equivalent the ordering of neurons/dimensions in data points because the ordering is not important. Conversely, when judging a vis, we should think about what this vis emphasizes about the data and what it discards. That is, we think of what the Grand Tour sees equivalent and whether it is aligned with our goals.

To see whether the Grand Tour is a desired vis in viewing a linear transformation, we take advantage of a central, amazing fact of linear algebra. The Singular Value Decomposition (SVD) theorem shows that any linear transformation can be decomposed into a sequence of very simple operations: a rotation, a scaling, and another rotation. Applying a matrix A to a vector x is then equivalent to applying those simple operations: x A = x U \Sigma V^T. But remember that the Grand Tour works by rotating the dataset and then projecting it to 2D. Combined, these two facts mean that as far as the Grand Tour is concerned, visualizing a vector x is the same as visualizing x U, and visualizing a vector x U \Sigma V^T is the same as visualizing x U \Sigma. This means that any linear transformation seen by the Grand Tour is equivalent to the transition between x U and x U \Sigma - a simple (coordinate-wise) scaling. This is explicitly saying that any linear operation (whose matrix is represented in standard bases) is a scaling operation with appropriately chosen orthonormal bases on both sides. So the Grand Tour provides a natural, elegant and computationally efficient way to align visualizations of activations separated by a fully-connected layer! Note that for convolutional layers, they can also be seen as linear transformations between flattened feature maps, so the same aligning mechanism applies to convolutional layers as well.

Since we have activations of all layers, to reduce the data size, we cut the number of data points to 500 and epochs to 50. Despite less points and epochs are shown, now we are able to trace a behavior we observed in softmax back to any layers. In fashion-MNIST, for example, we observe a separation of shoes (sandals, sneakers and ankle boots as a group) from all other classes in the softmax layer. Tracing it back to earlier layers, we can see that this separation happened as early as layer 5:

With layers aligned, it is easy to see the early separation of shoes from this view.

From basis to instances: extending direct manipulations

Unlike the softmax layer, there is one problem with our axis handles in high dimensional hidden layers. Although we have preprocessed data (we chose up to 45 principle components through PCA) so that their dimensionality is at manageable size for the Grand Tour, the axes lines still make a hair ball, and more importantly, these axes in hidden layers do not possess as much meaning as the softmax. Unlike the softmax or pre-softmax where each axis represents a label class, in hidden layers, one may need some indirect ways (e.g. feature visualization ) to understand the semantics. But even we used some feature visualizations, these derived meanings are hard to remember or encode in the Grand Tour. Thus we do not have axes handles to change the view of the Grand Tour. However, we can regard data points as handles this time. An orange circle would show up after brushing a subset of data points and we can use it as handles to rotate the view. We will cover how it works in technical details.

The Grand Tour of adversarial dynamics

The Grand Tour can also elucidate adversarial examples as they are processed by a neural network. For this illustration, we adversarially add perturbations to 89 digit 8s to fool the network into thinking they are 0s. In the previous sections, we animate either the training dynamics or the layer dynamics. Here, we fix a well-trained neural network and a layer of interest, and visualize the evolution of adversarial examples, for which we used the Fast Gradient Sign method at each step. Again, because the Grand Tour is a linear method, the change in the positions of the adversarial examples over time can be faithfully attributed to changes in how the neural network perceives the images, rather than potential artifacts of the visualization.

From this view of softmax, we can see how ⬤ adversarial examples evolved from ⬤ 8s into ⬤ 0s. In the corresponding pre-softmax however, these adversarial examples stop around the decision boundary of two classes. Show data as images to see the actual images generated in each step, or dots colored by labels.

We start from 89 examples of digit 8 (black dots in the plot above) and gradually add adversarial noise to these images, using the fast gradient sign method . Through this adversarial training, the network eventually beliefs that all these images are 0s. If we stay in the softmax layer and slide though the adversarial training steps in the plot, we can see adversarial examples move from a high score for class 8 to a high score for class 0. Although all adversarial examples are classified as the target class (digit 0s) eventually, some of them detoured somewhere close to the origin (around the 25th epoch) and then moved towards the target. Comparing the actual images of the two groups, we see those that the detoured images are noisier.

What is more interesting is what happens in the intermediate layers. In pre-softmax, for example, we see that these fake 0s behave differently from the genuine 0s: they live closer to the decision boundary of two classes and form a plane by themselves.

Discussion

The power of animation and direct manipulation

Previously, we compared the Grand Tour with other dimensionality reduction techniques. We used small multiples of random projections to serve as key frames of Grand Tour. The trade-offs between small multiples and animations is an ongoing discussion in the vis community. Here, we value the animation more than multiple static plots, for it eliminate certain kind of misleaders in the vis and thus gives a better data-visual correspondence. In a linear projection, points that form a straight line in the original space are guaranteed to form a straight line in its low dimensional projection. However, since the mapping is from a high dimension to a lower one, there can be false positives - a straight line on screen may not correspond to a real straight line in original space - the line being shown may be an illusion caused by this the lossy mapping. With static plots, this potential illusion can never be eliminated, since we only see one projection at a time. With animations, a viewer can visually infer and confirm the existence of a line in high dimensional space by looking at multiple frames of different projection. The longer a line persist in the animated projection, the higher chance that it is indeed a straight line in its original space. The smoothness animation gives the viewer an easier way to link points from view to view, compared to presenting the same thing with a sequence of snapshots. Therefore animation in this context enables a form of visual hypothesis testing, with the null hypothesis being "the line I see is created merely by the projection instead of data". As you may be convinced by the figure in the the comparison section, it is harder to recognize straight lines with high confidence with static plots than with Grand Tour animation, not to mention more complicated structures such as curves and clusters.

Generalizing to non-sequential models

In our work we have walked through models that are purely sequential. In modern architectures it is common to have non-sequential parts such as small ephemeral highway branches or dedicated branches for different tasks . With our technique, technically one can visualize neuron activations on each such branch, but more interesting to us in future work is how in theory our visual encoding could be extended in order to efficiently interpret such complex architectures.

Scaling to larger models

Modern architectures are also wide. Especially when convolutional layers are concerned, one could run into issues with scalability if we see such layers as a large sparse matrix acting on flattened multi-channel images. For the sake of simplicity, in this article we brute-forced the computation of the alignment of such convolutional layers by writing out their explicit matrix representation. However, the singular value decomposition of multi-channel 2D convolutions can be computed efficiently , which can be then be directly used for alignment, as we described above.

Conclusion

Although the Grand Tour is not a perfect solution for high-dimensional data visualization, the examples above highlight the importance of ensuring that the properties of the phenomenon we want to visualize are in fact respected by the visualization method. As powerful as t-SNE and UMAP are, they often fail to offer the correspondences we need, and such correspondences can come, surprisingly, from relatively simple methods like the ones we presented.

In the interest of completeness, the remainder of the article provides a number of technical details about the implementation of the techniques we described above.

Technical Details

Direct manipulation

Our direct manipulation has two modes: on the axes and on data points. The math behind the two is simple. Convention-wise, in this note we write vectors in rows and (matrices of) linear transformations are applied on its right. This makes composition of (linear) functions read easily from left to right. Note this is also a common practice when programming with data, since tabular data is often stored in rows. Our goal is to change the Grand Tour view GT so that the projection of axis moves according to the user's dragging. That is:

Given the current orthogonal matrix GT, find a nearby orthogonal matrix GT' that move the i^{th} basis vector e_i by (dx, dy) in the first two coordinate of the projected space: e_i \cdot GT' = e_i \cdot GT + (dx, dy, ...)

Note the projection of standard basis e_i is exactly the i^{th} row of GT, thus GT'_{i,:} = GT_{i,:} + (dx, dy, ...)

So intuitively, the new matrix can be found by copying the old one and add to its two entries by dx and dy.

Two problems need clarification: First, there are infinitely many solutions for such GT'. Since user only specified two entries of GT', the remaining entries can be fed with any values. A simple assumption that forces a unique solution is that the new GT should be close to the original one. Thus the other unspecified (p^2-2) entries of GT^{(new)} should stay the same as the old GT. Second, this specification may not satisfy orthogonality constraint of GT. For example, if GT'_{i,1} > 1, the i^{th} row has an l2-norm larger than 1, so the matrix can never be orthogonal. We want to maintain orthogonal projections because it has nice properties. (One of them is that it never exaggerates data norm.) To maintain an orthogonal projection, for each infinitesimal change in GT, we immediately pull the new GT back to a nearby orthogonal matrix. For example, we can use Gram-Schmidt process on row vectors of GT', with the user dragged i^{th} row as its first vector to consider in Gram-Schmidt.

Extending axis handles: rotation in 2-subspaces

As we have said, when the dimensionality is high, manipulating the viewing perspective by axis handles is complicated. A better and more intuitive way is probably by dragging (centroids of) data points, instead of dragging axis handles. Seeing unit vectors along axis directions as special data points, we will see this scheme is a generalization of the previous one and has a nice geometric interpretation: a dragging causes rotation in a 2-subspace.

We want to rotate the plane which is spanned by the two centroids before and after user's movement. Let X_{n_s \times p} denote the n_s points selected by brushing. The centroid c_0 \in R^{p} in Grand Tour coordinate is computed by c_0 = (\frac{1}{n_s} \sum_{i=1}^{n_s} X_{i,:}) \cdot GT Where GT_{p \times p} is the matrix of the Grand Tour rotation. Dragging the centroid by a small amount adds a small change (dx, dy) in the first two coordinates of c_0, which gives the position of a new centroid c_1 \in R^{p}: c_1 = c_0 + (dx, dy, 0, 0, \cdots) By completing the space, we can form an orthogonal matrix Q_{p \times p} whose first two columns span the same plane as span(c_0, c_1). Using Q as a change of basis, we rotates the plane of span(c_0, c_1) by an angle \theta, which is the angle between two centroids c_0 and c_1. The matrix of such rotation is given by: R = Q \begin{bmatrix} cos \theta& sin \theta& 0& 0& \cdots\\ -sin \theta& cos \theta& 0& 0& \cdots\\ 0& 0& \\ \vdots& \vdots& & I& \\ \end{bmatrix} Q^T =: Q R_{1,2}(\theta) Q^T The old Grand Tour matrix GT then absorbs this rotation and generates a new Grand Tour view in which selected points follow user's manipulation: GT^{(new)} = GT \cdot R = GT \cdot (Q R_{1,2}(\theta) Q^T) In essence, the dragging of user caused an extra rotation by \theta rad, with respect to the basis whose first two basis vectors are the span of previous and current position of a handle. Note Q here serves as a change of basis so that rotation can be easily specified in matrix form.

Direct manipulation as Procrustes problems

Another to formulate the direct manipulation is through Procrustes problems. When dragging, the user wants to translate each selected points in the direction of dragging, while keeping the other non-selected points fixed. We can formulate this as an optimization over orthogonal transforms. Suppose we are looking for a rotation that aligns the old view of data points to a new view in which selected points are translated. This is a typical objective in (orthogonal) Procrustes problems, which has a closed form solution. If we discard the part on the hidden dimensions which is not shown on canvas, the problem is called projection Procrustes, which has an iterative algorithmic solution. We will explain the orthogonal version first.

Before any formula we encourage our readers to go back to our interactive demo about layer dynamics and experience the difference between these methods.

To formalize the orthogonal Procrustes problem, let X_0 be the n \times p matrix of the original dataset with n rows of individual data points. Let Y_0 := X_0 \cdot GT be the rotated data by the Grand Tour matrix, GT_{p \times p}. By permuting the rows in X_0 and Y_0, we can let its first n_s rows be data points selected by brushing. Suppose user drag these points by (dx, dy), it means he/she want to update the Grand Tour matrix to GT^{(new)} such that it project data points in X_0 to Y_1, where Y_{1[i,1]} = Y_{0[i,1]} + dx Y_{1[i,2]} = Y_{0[i,2]} + dy for 1 \leq i \leq n_s, and Y_{1[i,j]} = Y_{0[i,j]} elsewhere (1 \leq i \leq n_s, 3 \leq j \leq p). Since user manipulation happens under the frame after the Grand Tour rotation, we can think there is an post-rotation Q^* after the Grand Tour that rotates Y_0 to some orientation closest to Y_1. That is, we are looking for an orthogonal p \times p matrix Q that minimizes the sum of square difference between Y_0 Q and Y_1: Q^* := argmax_{Q}\; ||Y_0 Q - Y_1||^2 where ||\cdots||^2 denotes frobenius norm squared. This optimization has a closed form solution : Q^* = VU^T where U\Sigma V^T is the singular value decomposition of Y_1^T Y_0. To resume the Grand Tour from the new orientation, the Grand Tour matrix should absorbs Q^* on its right: GT^{(new)} = GT \cdot Q^*

Instead of looking for full p \times p matrix that aligns all dimensions, we can restrict our attention to the first d (e.g. d=2) dimensions. That is, we look for an optimal projection P_{p \times d} that aligns first d dimensions of Y_0 and Y_1: P^* := argmax_{P}\; ||Y_0 P - Y_{1[:,:d]}||^2 which is known as projection Procrustes problems. Unfortunately, the solution can only be found with an iterative algorithm. We reproduced the algorithm given in the Procrustes problem book in our context.

Each iteration of step 2 will generate a new rotation matrix. The sequence of these matrices will be multiplied to the right of the current Grand Tour matrix. Also interesting is that the first orthogonal Procrustes iteration will behave similar to PCA, since it tries to minimize the norm in last p-d dimensions. We will demonstrate that this operation is also useful for data exploration in Grand Tour.

Although the above algorithm is slow in our implementation, interestingly, if we only compute one iteration, its behavior is similar to doing PCA on the subset of points user brushed. We found this operation alone very useful when points are clustered and unable to be separated by other methods.

Aligning representations: enabling the visualization of layer-to-layer dynamics

The R^p \to R^p case

As noted before, we preprocessed data so that the neuron activations in one layer is aligned with another. This preprocessing significantly helps us in understanding the essence of neural networks. Without this, there will be a lot of unnecessary cross-the-origin and switching axes shown in the transition between layers.

We will first see this in a case where the number of neurons lying in between a linear map are the same, later we extend this to cases where the number differs in two layers.

Suppose we want to look at a linear transformation in Grand Tour. Data that are stored in rows vectors of X_{n \times p} and Y_{n \times p} are related by a full-rank weight matrix M, i.e. Y = X M The SVD of M, U \Sigma V^T, tells us one way to look at this linear map in three consecutive steps: X \overset{U}{\mapsto} \overset{\Sigma}{\mapsto} \overset{V^T}{\mapsto} Y Since U and V^T are orthogonal, they rotate (and/or reflect) the data points in the Euclidean space R^p. \Sigma is a square and diagonal matrix, so it stretches each dimension i independently, by a factor of singular value \sigma_i := \Sigma_{i,i}. Geometrically, this decomposition allows us to explain the effect of the linear map in three steps: any data point (seen as a vector) passing through a linear map will be first rotated by U, stretched by \Sigma and finally rotated by V^T.

One may need matrix exponentials to compute intermediate steps for the transitions . However, we argue that the first and last step of this three-step animation is not necessary, as soon as we observe that they are not different than a Grand Tour rotation.

Consider a p-dimensional case where the linear transformation M is simply a permutation. We should convince ourselves that this layer does essentially nothing. It does reorder neurons, but for any such reordering there is a permutation in rows of the downstream matrix that maintains the same downstream activation. Geometrically, we think that the rotational components of M, i.e. U and V^T in its SVD, only give convenient bases to the stretching operation \Sigma and downstream operations such as adding biases and non-linear transformations. Moreover, since the Grand Tour is also rotating the data, viewing the rotation U or V^T is redundant and they should not give us any more insights than the Grand Tour. In neural networks, it is the scaling, bias together with non-linear functions that actually reshapes the collection of data points in space to fit our task, rotation only gives a convenient basis. Given that, all linear maps should be simplified to scalings, at least when viewing them in the Grand Tour.

Given these considerations, we want to reduce the rotation-scaling-rotation operation and shows its scaling component only. Note that by definition, Y = X U \Sigma V^T. Multiplying by V on the right of both sides gives Y V = X U \Sigma Since U and V are orthogonal, YV and XU can be seen as rotated versions of data. With these versions of data, the transformation from XU to YV is simply scaling on each dimension of XU, with the scaling factors given by the diagonal entries of \Sigma.

In summary, for any linear map M: X \mapsto Y, with SVD of M = U\Sigma V^T we align coordinates in X and Y by rotating them to XU and YV respectively. We will then visualize the transformation by linearly interpolating the rotated data and show them through the Grand Tour. If matrix P stores rows of data points whose first two coordinates are plotted on canvas, and s \in [0,1] is a parameter for transition and t is time for Grand Tour, P(s, t) = ((1-s)XU + sYV) \cdot GT(t)

In this transition, linear interpolation between XU and YV faithfully conveys the essential effect of the linear map M. Contrastingly, a naive linear interpolation between the original X and Y creates artifacts that is caused by internal basis change in the linear operator. For example, if operator M = [[0,1],[1,0]] is applied on a vector v \in R^2, linear interpolation between v and vM will exchange two coordinates, but as much as neural network is concerned, this operation should be equivalent to identity map. This equivalent relation is used explicitly when rotating X and Y to XU and YV.

The R^p \overset{M}{\to} R^r case, p\neq r

Some extra care regarding to both understanding and implementation must be taken when the dimensionality of two layers differ. In a linear map M: R^p \overset{}{\to} R^r, the (full-matrix) SVD of a matrix M_{p \times r} has shape M = \,_{p}U_{p} \Sigma_{r} V^T_{r}

When p\lt r, linear operator M represents an embedding. The rotated data points \,_{n}X_{p}U_{p} in R^p should be seen as embedded in a larger space R^r, in which \,_{n}Y_{r}V_{r} lives. The linear interpolation between two data sets can be made possible by padding r-p zeros to the right of XU.

When p\gt r, M represents a projection. The rotated data points \,_{n}X_{p}U_{p} should be seen as projected down to its first r coordinates, in which \,_{n}Y_{r}V_{r} lives. The linear interpolation between two activations can be made possible by padding r-p zeros to the right of YV.

Our presentation of layer-to-layer visualization implicitly assumed a setting in which only two layers were involved. When animating more than one transition at once, we need to deal with the discrepancy between the pairwise alignments. This problem only arises when composing linear operations. Non-linear layers, as we discussed above, have natural alignments. For example, when dealing with three operators M_1, ReLU and M_2 where M_1 and M_2 are linear and ReLU is component-wise non-linear, we do not have to make additional considerations for the ReLU operation. For two consecutive linear maps M_1 and M_2, suppose X \overset{M_1 = U_1 \Sigma_1 V^T_1}{\mapsto} Y \overset{M_2 = U_2 \Sigma_2 V^T_2}{\mapsto} Z Suppose we have aligned XU_1 and YV_1 and used Grand Tour matrix GT_1 to show the transition in M_1, and we know that YU_2 and ZV_2 will be viewed with a different Grand Tour matrix GT_2. To make the Grand Tour view consistent when switching the representation from YV_1 to YU_2, in the layer of Y we have to internally switch from GT_1 to GT_2 so that the view is unchanged. Expressing our need in an equality, we want a GT_2 such that YV_1 \cdot GT_1 = YU_2 \cdot GT_2 which gives us GT_2 = U_2^T V_1 \cdot GT_1

When convolutional layers are concerned, one can represent multi-channel 2D convolutions as a large and sparse, doubly block circulant matrix between flattened representations of multi-channel images. For models shown before doing SVD on the flattened convolution is feasible. For large networks where this is infeasible, properties of convolutional matrices also allow us to find data alignments efficiently . We will talk about scalability issues in later discussion.

Introduction