<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jwuphysics.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jwuphysics.github.io/" rel="alternate" type="text/html" /><updated>2026-02-09T17:24:13+00:00</updated><id>https://jwuphysics.github.io/feed.xml</id><title type="html">jwuphysics</title><subtitle>John Wu&apos;s personal website</subtitle><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><entry><title type="html">Stabilizing Deep Neural Networks by Staying Close to the Identity</title><link href="https://jwuphysics.github.io/posts/2026/01/stabilizing-deep-neural-networks/" rel="alternate" type="text/html" title="Stabilizing Deep Neural Networks by Staying Close to the Identity" /><published>2026-01-24T00:00:00+00:00</published><updated>2026-01-24T00:00:00+00:00</updated><id>https://jwuphysics.github.io/posts/2026/01/stable-deep-neural-networks</id><content type="html" xml:base="https://jwuphysics.github.io/posts/2026/01/stabilizing-deep-neural-networks/"><![CDATA[<p>Deep neural networks can be thought of as compositions of many simple transformations, each represented by a layer with trainable parameters. When the number of layers is large, the effect of multiplying many random matrices becomes exponentially unstable, i.e. they can grow or shrink exponentially. This is the primary reason that naive initialization leads to exploding or vanishing signals for both forward (activations) and backward (gradients). Nonetheless, stability is possible when each layer is close to the identity operation. With the right scaling of weights at initialization, a deep network acts like a time-discretized flow, and the total transformation resembles a matrix exponential of small perturbations.</p>

<p>Earlier this month I gave a <a href="https://science.nasa.gov/astrophysics/programs/cosmic-origins/community/ai-ml-stig-lecture-series-12-jan-2026/">talk</a> and <a href="https://github.com/tingyuansen/NASA_AI_ML_STIG/tree/main/Resources/Lecture8_John_Wu">tutorial on Inductive Biases</a> to the <a href="https://science.nasa.gov/astrophysics/programs/cosmic-origins/community/artificial-intelligence-machine-learning-science-technology-interest-group-ai-ml-stig/">NASA AI/ML Science &amp; Technology Interest Group</a>. Some audience members asked questions and pursued follow-up discussion about initialization, residual layers, and connections to differential equations. This post attempts to summarize the most important points and connect the dots.</p>

<p><em>Thanks to Gemini 3 for help with copyediting review and blog post formatting.</em></p>

<h2 id="products-of-random-matrices-and-instability">Products of random matrices and instability</h2>

<p>Let’s begin with a purely linear \( L \)-layer network</p>

\[x_L = W_{L-1} W_{L-2}\cdots W_1\, x_0,\]

<p>where the \( W_\ell \) are random matrices.</p>

<p>Classical results in random matrix theory show that (under reasonable assumptions) the norm of this product grows or decays exponentially with the number of factors. More precisely, if the \( {W_\ell} \) are i.i.d. with suitable integrability and irreducibility conditions, then the limit</p>

\[\lambda_1 = \lim_{L\to\infty} \frac{1}{L} \log \left\| W_{L-1}\cdots W_1 \right\|\]

<p>is typically nonzero. The number \( \lambda_1 \) is the top Lyapunov exponent. If \( \lambda_1&gt;0 \), forward signals explode; if \( \lambda_1&lt;0 \), they vanish. Backpropagated gradients obey the same kind of product dynamics (with transposes), so the same instability can jeopardize training as well.</p>

<p>This instability is not a quirk of linear networks alone. Nonlinear activations appear between matrices, but at initialization most common activations behave approximately linearly around zero (see below). Thus, without special care, both activations and gradients are driven toward regimes where numerical stability is quickly lost as depth grows.</p>

<h2 id="avoiding-instability-by-staying-close-to-the-identity">Avoiding instability by staying close to the identity</h2>

<p>The way around exponential blow-up or decay is by ensuring that each layer is close to the identity transformation. To see how this works, let’s express a single layer via:</p>

\[W_\ell = I + \varepsilon A_\ell,\]

<p>where \( \varepsilon&gt;0  \) is small and \( A_\ell  \) is a random matrix with mean near zero and bounded moments.</p>

<p>Consider the product</p>

\[M_L = \prod_{\ell=1}^{L} \left( I + \varepsilon A_\ell \right),\]

<p>where the product is ordered so that \( \ell=1 \) acts first on the input. We can look at two regimes in more detail to gain intuition:</p>

<ol>
  <li>
    <p>If we choose \( \varepsilon = L^{-1} \) and we let \( L\to\infty \), then—even with noncommuting operators, we can use the <a href="https://en.wikipedia.org/wiki/Lie_product_formula">Trotter product formula</a> to find
\(M_L \to \exp\!\left( \frac{1}{L}\sum_{\ell=1}^{L} A_\ell \right)\)as \(L\to\infty,\) for random \( A_\ell \). The average of the \( A_\ell \) should be finite, and thus the limit is a well-defined matrix exponential. Intuitively, we expect that many small, nearly commuting perturbations behave as a smooth exponential flow.</p>
  </li>
  <li>
    <p>If the perturbations are larger, say \( \varepsilon = L^{-1/2} \), then the product converges in distribution to the <a href="https://lpetrov.cc/rmt25/rmt25-notes/rmt2025-l10.pdf">stochastic or time-ordered exponential of a matrix-valued Brownian motion</a>. This is a random element of the <a href="https://en.wikipedia.org/wiki/General_linear_group">general linear group</a>, but still avoids the exponential blow-up that we’d worry about from unscaled random products.</p>
  </li>
</ol>

<p>Either way, we see that the result is a controlled product because each factor is a small deviation from \( I \).</p>

<h2 id="initialization-near-the-identity">Initialization near the identity</h2>

<p>If you’ve ever taken the <a href="https://course.fast.ai/Lessons/lesson17.html">Fastai Practical Deep Learning courses</a>, then you’ll know that good initializations in deep learning are crticial for stabilizing the forward and backward signals. We accomplish is possible by intializing each layer to be a small perturbation of the identity. The usual rule of thumb is to choose the entries of \( W_\ell \) to be independent, mean-zero, and with variance matching the inverse of each layer’s input dimension (i.e., fan-in dimension).</p>

<p>We can investigate the pre-activation of neuron \( i \) in layer \( \ell \),</p>

\[z_i^{(\ell)} = \sum_{j=1}^{n} W_{ij}^{(\ell)}\, a_j^{(\ell-1)},\]

<p>where \( a^{(\ell-1)} \) are activations from the previous layer and \( n \) is the fan-in. If \( a_j^{(\ell-1)} \) are centered with variance near unity, and the weights have variance \( \sigma^2 \), then,</p>

\[\mathrm{Var}\!\left(z_i^{(\ell)}\right) = n\,\sigma^2\,\mathrm{Var}\!\left(a^{(\ell-1)}\right) \approx n\sigma^2.\]

<p>To keep the scale from changing across layers at initialization, we can set \( n\sigma^2 \approx 1 \) for linear or tanh-like activations near zero, so \( \sigma^2 \approx 1/n \). For ReLU, which zeros out about half of its inputs and rescales the variance, we use \( \sigma^2 \approx 2/n \). These choices keep the variance of pre-activations and activations roughly constant with depth.</p>

<p>There is a geometric way to interpret this. When \( n\sigma^2 \) is of order one, the operator norm of \( W_\ell \) is typically of order one as well, so \( W_\ell \) does not significantly expand or contract the input space at initialization. In high dimensions, with mean-zero, light-tailed weights, this makes \( W_\ell \) act like the identity plus a small, approximately Gaussian perturbation.</p>

<h2 id="backward-stability-and-gradients">Backward stability and gradients</h2>

<p>Backpropagation propagates gradients through transposed weights. If the forward pass is stable at initialization, then the variance of gradients with respect to activations is also preserved layer to layer, provided the same scaling is used. So if \( \delta^{(\ell)} \) denotes the gradient signal at layer \( \ell \) and \( \phi \) represents the element-wise activation function (such that activations \( a^{(\ell)} = \phi (z^{\ell)}) \)), then</p>

\[\delta^{(\ell)} = (W_\ell)^\top \left(\phi'(z^{(\ell)}) \odot \delta^{(\ell+1)}\right),\]

<p>and, under independence and small-perturbation assumptions, the variance of \( \delta^{(\ell)} \) matches that of \( \delta^{(\ell+1)} \) when \( \sigma^2 \) is chosen as above and \( \phi’ \) has stable variance near initialization. The gradient with respect to a weight element is a product \( x\,\delta \), so its typical scale is controlled once the forward and backward variances are controlled. Thus the same near-identity reasoning stabilizes both directions of signal flow.</p>

<h2 id="why-does-this-still-work-with-nonlinear-activations">Why does this still work with nonlinear activations?</h2>

<p>Common activation functions behave approximately linearly around the origin. At initialization, pre-activations are centered and have controlled variance, so the network operates near this linear regime. As a result, the variance-propagation calculations, which are exact for linear activations, remain accurate approximations. For ReLU, we can account for the gating effect by adjusting the weight variance by a factor of two. For smooth activations like tanh or GELU, we can similarly compute the derivative near zero to set the appropriate scaling. In each case, the upshot is that—because we’ve kept initialization near the identity—each layer maps inputs to outputs without exponential instability.</p>

<h2 id="why-do-resnets-work-so-well">Why do resnets work so well?</h2>

<p>The incredible success of <a href="https://arxiv.org/abs/1512.03385">residual neural networks</a> (resnets) shows the practicality of staying close to the identity. A residual block updates via</p>

\[x_{\ell+1} = x_\ell + f_\ell(x_\ell),\]

<p>which is just an identity operation plus a small learned perturbation (in a discrete form). As depth grows, the network approximates a continuous-time flow described by an ordinary differential equation. That is, the sequence of layers behaves like a discretization of a continuous-time system,</p>

\[\frac{dx(t)}{dt} = f(t, x(t)).\]

<p>To understand how small perturbations to the input evolve through such a system, we can linearize each update. Over a short step of size \( dt \), a perturbation is transformed by a matrix of the form</p>

\[I + J(t)\,dt, \qquad J(t) = \frac{\partial f(t, x(t))}{\partial x},\]

<p>i.e. a near-identity matrix. The full effect of many such steps is therefore a product</p>

\[\left(I + J(t_L)\,dt\right)\cdots \left(I + J(t_1)\,dt\right).\]

<p>As the step size goes to zero, this product converges to the time-ordered (\( \mathcal{T} \)) exponential</p>

\[\mathcal{T}\exp\!\left( \int_0^1 J(t)\,dt \right).\]

<p>The overall transformation is then a time-ordered exponential of the accumulated Jacobians, analogous to the earlier argument about taking a product of small perturbations from the identity.</p>

<h2 id="summary">Summary</h2>

<p>This expository post tries to justify the practical recipes for initialization and explains why they are effective in keeping very deep neural networks numerically stable. Proper initialization is vital for avoiding the exponential growth or decay associated with products of random matrices. By choosing weights so that each layer is close to the identity, we ensure that the network behaves like a controlled exponential of small perturbations instead of an unstable sequence of far-from-identity operations. Variance-preserving initialization aligns forward activations and backward gradients to have stable scale across layers, while common nonlinearities can be incorporated by modest adjustments to the variance. Residual architectures explicitly encode this principle into the design, and thereby connect deep networks to the theory of continuous flows.</p>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="machine-learning" /><category term="tutorial" /><summary type="html"><![CDATA[Deep neural networks can be thought of as compositions of many simple transformations, each represented by a layer with trainable parameters. When the number of layers is large, the effect of multiplying many random matrices becomes exponentially unstable, i.e. they can grow or shrink exponentially. This is the primary reason that naive initialization leads to exploding or vanishing signals for both forward (activations) and backward (gradients). Nonetheless, stability is possible when each layer is close to the identity operation. With the right scaling of weights at initialization, a deep network acts like a time-discretized flow, and the total transformation resembles a matrix exponential of small perturbations.]]></summary></entry><entry><title type="html">Learning with LLMs</title><link href="https://jwuphysics.github.io/blog/2025/12/learning-with-llms/" rel="alternate" type="text/html" title="Learning with LLMs" /><published>2025-12-05T00:00:00+00:00</published><updated>2025-12-05T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/12/learning-with-llms</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/12/learning-with-llms/"><![CDATA[<p>AI is here, and its impacts on education cannot be overstated. Let’s put aside the issues of cheating; I assume that you <em>want</em> to learn, perhaps with the assistance of LLMs <em>if</em> they are actually helpful. But how do you know you’re not using AI as a crutch, versus using it to augment learning? The former setting outsources your thinking to AI, whereas the latter can help you reveal gaps in your understanding, bypass blockers that prevent learning, and/or tailor education to your style. In this post, I provide an analogy between learning and phase transitions in statistical mechanics, and describe recommendations and warnings on using LLMs in different learning scenarios.</p>

<h2 id="introduction">Introduction</h2>

<p>Even though it <em>feels</em> like you’re learning when you talk to LLMs, you might be selling yourself short: you’re liable to self-deception, thinking that you have a strong grasp of a topic when you do not. Nevertheless, I also believe that there are certain scenarios where AI <em>can</em> be used without compromising your ability to think. Exactly where the distinction occurs can vary from person to person, because everyone has different learning styles.</p>

<p>In my view, there are three regimes of learning: we begin by <em>building familiarity and intuition</em>, progress toward <em>spotting patterns and making connections</em>, and finally achieve <em>robust knowledge retention</em>. I’m going to use the idea of a phase transition from statistical mechanics as a high-level analogy for learning; these three regimes are really two phases, a subcritical phase (building intuition) and supercritical phase (robust knowledge), separated by a phase transition. It’s in this phase transition — the critical point — where connectivity between learned ideas becomes possible.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Throughout this post, I’m going to use two examples of learning: <em>addition</em> and <em>the covariance matrix</em>. The first one is based on me teaching my five-year-old daughter simple arithmetic, and the second reflects my own learning journey from an undergraduate physics major to a tenure-track researcher. In the example of learning about single-digit addition, you can imagine “summarization via an LLM” is like “asking dad for the answer” or “asking dad <em>why</em>”. In the second example, you can just imagine asking ChatGPT for intuitions about <em>covariance</em> or <em>PCA</em>.</p>

<h2 id="subcriticality-the-early-phase-of-learning">Subcriticality: The Early Phase of Learning</h2>

<p>At the earliest stage of learning, concepts don’t really leave a lasting impression. A number like <em>5</em> is just a symbol, and a word like <em>covariance</em> is simply some letters on a page. It is painfully difficult to connect these concepts to other ideas that you’re supposed to be learning at the same time, e.g., the number <em>4</em> is smaller and the number <em>6</em> is larger than <em>5</em>. Or maybe that your sample covariance matrix can be <em>diagonalized</em> to find <em>eigenvectors</em> and <em>eigenvalues</em>. And you could maybe remember these facts and procedures. But if somebody described Principal Components Analysis using some other choice of words, then you’d have no idea that they were describing the same ideas!</p>

<p>The problem is that in this <strong>subcritical</strong> phase, concepts easily fizzle out. They’re totally disconnected from your intuition, because that intuition needs to be formed via related concepts. If you have no intuition for all of the properties of the number <em>5</em> — that it is odd, that it is half of ten, that it is three less than eight, etc., then it might as well just be any random symbol on a page. You see symbols, but not the underlying <em>structure</em> of these numbers, probably because you simply haven’t spent enough time staring at them and thinking about their relationships.</p>

<p>At this stage, it might be easiest to just learn via rote memorization. (This varies by person — I have horrible memory, so I hate this phase of learning.) Back in undergrad, I remember buying the PDF for <em><a href="https://learnpythonthehardway.org/">Learn Python the Hard Way</a></em>, a textbook for learning the Python programming language. I printed out each of the pages, so that I would have to manually type in code, rather than copy-paste it! This helped me build up muscle memory and think about the Python code as I typed it in.</p>

<p>Lots of folks have found that spaced repetition is the best way to improve learning at this stage (e.g., <a href="https://gwern.net/spaced-repetition">Gwern</a>, <a href="https://apps.ankiweb.net/">Anki cards</a>, <a href="https://news.ycombinator.com/item?id=44020591">HN discussions</a>). At its core, spaced repetition is just testing how well you’ve remembered things over progressively longer time periods — the intervals get shorter if you forget something, and get longer if you are able to continue recalling it.</p>

<p>While there’s certainly some benefit to <em>assembling</em> the spaced repetition system (i.e., constructing a deck of Anki cards), I think that the most valuable technique is to regularly <em>use</em> it. That is, it’s okay if somebody else procures the answer for you. It’s okay if the definition of <em>sample covariance matrix</em> comes from a textbook or Wikipedia or Claude. You’re still at the subcritical phase, and you’ll need more exposure to the concepts before things start to click.</p>

<p>However, this phase of learning is still meant to be somewhat difficult. You should expect friction! Recall that it’s easy to nod your head in agreement when you read something that makes sense, but it’s far more difficult — and valuable — to re-derive it yourself.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> Once you become somewhat familiar enough with a concept, then it becomes much more rewarding to test that knowledge to see if you’ve reached the critical point.</p>

<p>I think that current LLMs are highly useful for testing your knowledge. For example, to assist my own learning, I frequently use a “Socratic” prompt with Gemini (although it’s out of date, and note that it was written with the assistance of Gemini 2.5 Pro):</p>
<blockquote>
  <p>Purpose and Goals:</p>
  <ul>
    <li>Help users refine their understanding of a chosen topic (Topic X).</li>
    <li>Facilitate learning through a Socratic method, prompting users to explain concepts and asking probing questions.</li>
    <li>Identify and address misunderstandings by testing the user’s conceptual knowledge.</li>
    <li>Focus on sharpening the user’s intuition and conceptual understanding rather than rote memorization.
[…]
The entire prompt can be found <a href="https://gist.github.com/jwuphysics/dace4952279e4b047c6c0591d09895d6">here</a>.</li>
  </ul>
</blockquote>

<p>This method of using LLMs should not make you reliant on AI. It does not outsource your thinking. Like the <a href="https://en.wikipedia.org/wiki/How_to_Solve_It"><em>How To Solve It</em> method</a> devised by the great mathematician George Pólya, and the eponymous <a href="https://solve.it.com/">SolveIt platform</a> run by great educator Jeremy Howard, my aim here is to demonstrate how to use LLMs as a <em>personalized tool</em> for testing your understanding. LLMs are now powerful enough that (for most topics), they can spot holes in your thinking; however, given their tendencies toward sycophancy, LLMs must be prompted carefully.</p>

<h2 id="supercriticality-the-late-phase-of-learning">Supercriticality: The Late Phase of Learning</h2>

<p>At some point, the dots really start to connect. Beyond the critical point, all your knowledge is linked together. You have intuitions for concepts that you may not have heard of. You’re so comfortable with addition, that you also intuitively grasp concepts like associativity (<em>1 + 3 = 3 + 1</em>) or inverses (<em>adding one to three makes four, so taking away one from four makes three</em>), even though you may not have heard of (or recall) the jargon from algebra or group theory. In any event, you have a robust conceptual understanding, and all that remains is to give names to these well-understood concepts.</p>

<p>In this phase, learning <em>should</em> feel easy and fun. There are likely still gaps in your knowledge, but it’s quite straightforward to fill them in. Your knowledge is robust even when you’re missing certain pieces of information because you’ve trodden all around that <em>terra incognita</em>, so new knowledge doesn’t dramatically upend your understanding.</p>

<p>My late father had a PhD in chemistry. He loved to personify everything and attach <em>feelings</em> to them: oxygen <em>wants</em> to gain two electrons, the molybdenum <em>likes</em> to react when this catalyst is present, etc. We develop a similar feel for concepts when our understanding passes the critical point. And this intuition is vital for pursuing novel research ideas and making scientific discoveries.</p>

<p>Or, you can plausibly extend your knowledge to other domains because you have crystallized the relevant intuitions. For example, in the excellent pedagogical text <a href="https://arxiv.org/pdf/1008.4686"><em>Data analysis recipes: Fitting a model to data</em></a>, David Hogg et al. write:</p>
<blockquote>
  <p>The inverse covariance matrix appears in the construction of the ( \chi^2 ) objective function like a linear “metric” for the data space: It is used to turn a N-dimensional vector displacement into a scalar squared distance between the observed values and the values predicted by the model. This distance can then be minimized. This idea—that the covariance matrix is a metric—is sometimes useful for thinking about statistics as a physicist.</p>
</blockquote>

<p>While physicists can be guilty of thinking that they can leap into other fields (<a href="https://xkcd.com/793/">relevant XKCD</a>), they often <em>do</em> have a strong grasp of mathematics and physical intuition. This combination is invaluable at the supercritical stage: the language of mathematics often translates well to other disciplines, while the intuition from physics can be helpful for predicting dynamics or phenomena given those mathematical laws.</p>

<p>In the supercritical phase of learning, LLMs can be helpful. They are pretty good at identifying analogous concepts in alternate fields that you might not know about, acting as both the proverbial <em>water cooler</em> and the multidisciplinary scientists that congregate around it. LLMs can also be used to quickly refresh ideas that you have briefly forgotten, like going back to reference your old textbooks to check some relevant information. However, this can also be dangerous if you <em>think</em> you’re past the critical point — but in reality you aren’t (often, because your confidence is inflated by talking too much to LLMs).</p>

<h2 id="the-pitfalls-of-using-llms-to-help-you-learn">The pitfalls of using LLMs to help you learn</h2>

<p>I’m reminded of one salient point from <a href="https://www.alignmentforum.org/posts/hjMy4ZxS5ogA9cTYK/how-i-think-about-my-research-process-explore-understand">Neel Nanda’s delightful essays on research taste</a>. While not the main focus of those pieces, he explains (emphasis mine):</p>
<blockquote>
  <p>Junior researchers often get stuck in the early stages of a project and don’t know what to do next. In my opinion this is because they think they are in the <strong>understanding stage</strong>, but are actually in the <strong>exploration stage</strong>.</p>
</blockquote>

<p>In other words, junior researchers can sometimes believe that they have crystallized their understanding of a topic (i.e. supercritical), when in reality they are still in an earlier stage of learning (subcritical)! This is particularly worrisome when LLMs can summarize topics in easily understandable ways, deceiving junior researchers into feeling like they confidently understand a topic because they’ve understood the simple LLM summarization.</p>

<p>LLMs are truly a double-edged sword for learning.</p>

<p>On one hand, they can be helpful by testing your knowledge in new ways (e.g. the Socratic approach I mentioned above; I also have another prompt that quizzes me with PhD qualifying exam-like questions). LLMs can help you get unstuck when your canonical text doesn’t explain something in a manner intelligible to you. They can get rid of irrelevant roadblocks (e.g., you’re learning about neural network optimization, but stuck in CUDA dependencies hell). LLMs can spin up highly individualized games that help you learn concepts in a way that’s much more fun than doing practice problems.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<p>On the other hand, LLMs can leave you with a completely shallow understanding of a topic — while you feel like you totally understand it all! This is compounded by the fact that LLMs will tend toward positivity. Do not let your confidence be bolstered by hollow AI validation. Be vigilant and skeptical, because uncritical use of AI tools will absolutely inhibit your learning.</p>

<h2 id="the-pitfalls-of-using-llms-for-summarization">The pitfalls of using LLMs for summarization</h2>

<p>One can imagine a world where AI widens the gap between those who <a href="https://paulgraham.com/writes.html">practice writing and those who do not</a>. This is problematic because — as all experienced researchers know — <a href="https://www.nature.com/articles/s44222-025-00323-4">writing is thinking</a>. If we don’t practice writing, then we shut the door on an entire mode of thinking.</p>

<p><em>But what about reading?</em> Sometimes it feels like a slog to read through long articles, especially technical, information-dense academic papers. Why not just get it all distilled into a single paragraph? Or turn it into a podcast-like audio overview? As I <a href="https://bsky.app/profile/jwuphysics.bsky.social/post/3m65eelo5gs2f">wrote on social media</a>, <em>using AI to summarize information is also a way to outsource your thinking</em>.</p>

<p>When we read or write, we are constantly re-organizing our understanding of topics. This happened at least three times for the very blog post you’re reading; the title and content has evolved dramatically over the past two weeks since I began writing it.</p>

<p>I contend that <strong>summarization is thinking</strong>. When I am reading about a new topic, I know that I’ve understood it <em>only</em> when I can accurately and concisely summarize it.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> Robust summarization is only possible when you can grasp the big picture intuitions and connect them to the minute details. That mental organization <strong>is</strong> a part of the learning process. When an LLM does this organization on your behalf, then your mental muscles atrophy.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>But don’t worry too much about the details of this percolation theory analogy, and like all analogies, it breaks down under scrutiny. I hope you’re not distracted by wondering about the order parameter or anything like that. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I sometimes call this the <em>generator–discriminator asymmetry</em>. This is commonly used to describe how it’s far easier for a GAN to discriminate between a real or generated output, than to generate some new output that can fool the discriminator. It can also be used to refer to human learners: discriminating right from wrong information is easier than correctly deducing something from scratch! (Side bar to my footnote: this also gets at why multiple choice questions are <a href="https://www.nature.com/articles/s41598-025-26036-7">bad for evaluating LLMs</a>!) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Gemini 3 is stunningly good at this. In about 60 seconds, it created a simple HTML game for my daughter to learn simple combinatorics, based on my prompt to center the game around making permutations and combinations of <em>N</em> ice cream flavors given <em>P</em> scoops of ice cream on a cone. She loved it! <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>It is also incredibly easy to discern which junior researchers have truly understood a topic versus those who haven’t by asking them to summarize a topic. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="education" /><category term="llms" /><category term="personal" /><summary type="html"><![CDATA[AI is here, and its impacts on education cannot be overstated. Let’s put aside the issues of cheating; I assume that you want to learn, perhaps with the assistance of LLMs if they are actually helpful. But how do you know you’re not using AI as a crutch, versus using it to augment learning? The former setting outsources your thinking to AI, whereas the latter can help you reveal gaps in your understanding, bypass blockers that prevent learning, and/or tailor education to your style. In this post, I provide an analogy between learning and phase transitions in statistical mechanics, and describe recommendations and warnings on using LLMs in different learning scenarios.]]></summary></entry><entry><title type="html">Re-envisioning Euclid Galaxy Morphology</title><link href="https://jwuphysics.github.io/blog/2025/10/euclid-galaxy-morphology/" rel="alternate" type="text/html" title="Re-envisioning Euclid Galaxy Morphology" /><published>2025-10-29T00:00:00+00:00</published><updated>2025-10-29T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/10/euclid-galaxy-morphology</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/10/euclid-galaxy-morphology/"><![CDATA[<p>With the <a href="https://www.cosmos.esa.int/web/euclid"><em>Euclid</em></a> and <a href="https://science.nasa.gov/mission/roman-space-telescope/"><em>Roman Space Telescope</em></a> missions ready to image billions of galaxies, we’ll need data-driven methods to find new, rare phenomena that exist outside human-defined taxonomies! Sparse Autoencoders (SAEs) can be that discovery engine, surfacing interpretable features in modern galaxy surveys. This blog post highlights some preliminary results from our tiny NeurIPS <a href="https://ml4physicalsciences.github.io/2025/">ML4PS workshop</a> paper, jointly led by <a href="https://walmsley.dev/">Mike Walmsley</a> and me. Read the paper <a href="https://arxiv.org/abs/2510.23749">here</a>.</p>

<h2 id="galaxy-morphologies-in-euclid">Galaxy Morphologies in <strong><em>Euclid</em></strong></h2>

<p><em>Euclid</em><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> is now delivering crisp, space-based imaging for millions of galaxies. Among the many scientific results presented in their <a href="https://www.euclid-ec.org/science/q1/">Q1 (“Quick Data Release 1”)</a> is a citizen science analysis of galaxy morphologies presented by Mike Walmsley et al. <a href="https://arxiv.org/abs/2503.15310">This paper</a> presents not only GalaxyZoo (GZ) volunteer classifications according to a decision tree — i.e., Is this galaxy featured? Does it have spiral arms? How many? — but also a <a href="https://jwuphysics.github.io/blog/2025/05/foundation-models-in-astronomy/">foundation model</a> (Zoobot) fine-tuned for predicting these decision classes on the <em>Euclid</em> galaxy dataset. You can check out the <a href="https://walmsley.dev/posts/zoobot-scaling-laws">Zoobot v2.0 blog post</a> and download it via <a href="https://github.com/mwalmsley/zoobot">Github</a>.</p>

<p>But Zoobot follows a supervised approach: we’ve delineated the taxonomy into which galaxies must fit. By definition, this CNN learns representations that enable it to accurately describe galaxy <em>according</em> to the GZ categories. Can we get a neural network model to represent galaxies outside of this taxonomy?</p>

<p>Yes! Our first result from this paper is to present a <a href="https://huggingface.co/mwalmsley/euclid-rr2-mae"><strong>Masked Autoencoder</strong> (MAE)</a> that learns galaxy imaging via self-supervised representations. Our MAE chops up images into 8×8 patches, and consists of a custom vision transformer (ViT) encoder with ~30M parameters, and a three-layer decoder. To get a sense of how it works, I highly recommend you checking out the interactive demo built by Mike: https://huggingface.co/spaces/mwalmsley/euclid_masked_autoencoder</p>

<p>Even when you remove 90% of the pixels of a Euclid image, the MAE can learn to reconstruct the rest of the image extraordinarily well. Honestly, it does far better than any human can. And not only does it work for galaxy images, but the MAE also learns to reconstruct bright stars and other objects in the <a href="https://huggingface.co/datasets/mwalmsley/euclid_rr2">Euclid RR2 dataset</a>.</p>

<h2 id="principal-components-of-galaxy-image-embeddings">Principal Components of Galaxy Image Embeddings</h2>

<p>Okay, so we have trained models, which means that we can encode <em>Euclid</em> images into Zoobot (supervised; <em>d=640</em>) and/or MAE (self-supervised; <em>d=384</em>) embeddings. How do we interpret these learned embedding vectors?</p>

<p>A good starting point is to use PCA (principal components analysis); the top PCs should summarize most of the variation in each dataset. It’s worth emphasizing that the supervised (Zoobot) and self-supervised (MAE) models are trained on <em>different</em> datasets: the Zoobot dataset comprises ~380k well-resolved galaxy images from <em>Euclid</em>, whereas the MAE dataset comprises &gt;3M <em>Euclid</em> images of galaxies, stars, artifacts, etc. <strong>Thus, it is not possible to make an apples-to-apples comparison between these two datasets or their embeddings.</strong></p>

<p><img src="/images/blog/euclid-gallery.png" alt="Figure 2 from Wu &amp; Walmsley 2025, showing PCA (top) and SAE (bottom) features extracted from Zoobot (left) and MAE (right) model embeddings. Note that these two models are trained on different datasets." /></p>

<p>The figure above, copied from <a href="https://arxiv.org/abs/2510.23749">Figure 2 in the paper</a>, displays top image examples for the first five PCs for Zoobot (<em>left</em>) and MAE (<em>right</em>) model embeddings. Some interesting notes:</p>
<ul>
  <li>For the supervised Zoobot embeddings, our first few PCs are well-aligned with the first few nodes of the GZ decision tree.
    <ul>
      <li>For example, the first PC mirrors the first GZ question of whether the galaxy is smooth or featured with Spearman <em>r</em>≈0.85.</li>
      <li>The next questions align with whether a featured galaxy has a disk that is seen edge-on, or has spiral arms, or has a prominent bulge, etc.</li>
      <li>Note that PCs can have both positive and negative coefficient, so the first PC with a very positive coefficient would characterize a very smooth (e.g. spheroidal) galaxy, while a very negative coefficient would characterize a stronlgy featured (e.g. spiral arms) galaxy!</li>
    </ul>
  </li>
  <li>For the self-supervised MAE embeddings, the representations are totally different than before.
    <ul>
      <li>In several of the top PCs, we find cosmic ray hits or other imaging artifacts.</li>
      <li>We think these dominate much of the MAE latent space because it’s fundamentally challenging to reconstruct images with imaging artifacts!</li>
      <li>Galaxies also appear in here, although individual PCs do not align nearly as strongly to the GZ categories.</li>
    </ul>
  </li>
</ul>

<p>PCA is nice because it rank-orders features by how much they explain the variance in the embedding vectors. But what if the features you want require <em>non-linear</em> combinations of embeddings? Or what if your original embeddings are noisy, so each PC depends on <em>all</em> inputs — this might result in uninterpretable features.</p>

<h2 id="sparse-autoencoders-for-interpretability-and-discovery">Sparse Autoencoders for Interpretability and Discovery</h2>

<p>For this reason, we chose to use a sparse coding method, Matryoshka Sparse Autoencoders (SAEs), to discover features! They’re extremely simple: embeddings get fed into a single layer decoder (with ReLU activation), wherein only a few neurons are allowed to be active.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> From these sparse activations, a single-layer decoder (i.e. a projection matrix) learns to reconstruct the original embeddings. Because the latent activations are sparse, the SAE must use only a <em>few</em> neurons to reconstruct each given input (i.e., the original images), which results in more interpretable features. Possibly even <strong>monosemantic features</strong> — that is, instead of a many-to-many mapping between neuron activations and semantic concepts, we can use SAEs to recover a one-to-one mapping between activations and concepts.</p>

<p>Or so the story goes. Famously, Anthropic found a <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/">Golden Gate Bridge feature in Claude</a> that activates on both text and images! But… while SAEs are sure to learn sparse, non-linear combinations in an overcomplete space, we don’t actually have mathematical guarantees that SAEs will find monosemantic or disentangled features. What does monosemanticity even really mean? Should galaxies with Sersic indices of 2.1 activate a different feature than galaxies with Sersic indices of 2.2? Indeed, there is significant evidence that SAEs do not fare as well as linear probes for <em>already known</em> features, leading <a href="https://deepmindsafetyresearch.medium.com/negative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9">some</a> <a href="https://www.alignmentforum.org/posts/osNKnwiJWHxDYvQTD/takeaways-from-our-recent-work-on-sae-probing">research</a> <a href="https://arxiv.org/abs/2501.17148">teams</a> to focus on other topics in mechanistic interpretability.</p>

<p>Anyway, let’s just see what happens. Take a look at the figure above again, and now focus on the bottom panels. These now show the first five SAE features, ranked in order of how frequently they are activated. For the supervised example (on the lower left), we can see reasonably coherent/interpretable features: two-armed spirals, ringed galaxies, spheroidal galaxies, elliptical galaxies, and objects with tidal features, clumps, or companions. (This last one is the least monosemantic, but it’s intriguing because each of those features can be indicative of galaxy–galaxy interactions or mergers!) For the self-supervised MAE (on the lower right), we also see some consistency in SAE-extracted features. Huh!</p>

<p>We then quantify how well the PCA and SAE features align with GZ features, using the Spearman rank correlation coefficient I discussed earlier. Again, we shouldn’t compare between the supervised and self-supervised models, but we can now compare PCA and SAE features! And we find a clear winner: SAE features <em>are</em> typically more aligned with the GZ taxonomy!</p>

<p>Qualitatively, we also find that the SAE <em>can</em> surface interesting features. This is most evident in the features extracted from Zoobot embeddings, where we know the supervised training objective. For example, we find examples of ring galaxies or dust lanes in edge-on disk galaxies — visually clear signatures of coherent features that <em>aren’t</em> in the training objective. The MAE model is probably full of interesting SAE-extracted features, too, but some of them are definitely challenging to interpret.</p>

<p>Anyway, there’s much more to say, but at this point the blog post might be comparable in length to our workshop paper! Just go read the <a href="https://arxiv.org/abs/2510.23749">paper</a>, or try it out using our <a href="https://github.com/jwuphysics/euclid-galaxy-morphology-saes">code</a> — I’d love to hear what you think!</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Why do we italicize <em>Euclid</em>? Well, this observatory is also technically a spaceship, and all names of ships (including spaceships) <a href="https://style.mla.org/format-the-name-of-a-ship/">should be italicized according to the MLA</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>We actually use <a href="https://arxiv.org/abs/2412.06410">BatchTopK sparsity</a>, and also nest the SAE activations in “groups” that progressively expand the sparsity bottleneck (i.e., <a href="https://arxiv.org/abs/2503.17547"><em>Matryoshka</em> SAEs</a>). We also imposed L1 sparsity and revived dead neurons with an auxillary loss term. Note that SAEs also typically demand an overcomplete latent space. Each of these hyperparameters affect training and subsequent feature extraction; Charlie O’Neill and Christine Ye et al. looked into some of these SAE hyperparameter interactions in an <a href="https://arxiv.org/abs/2408.00657">earlier paper</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="computer-vision" /><category term="galaxies" /><category term="interpretability" /><category term="machine-learning" /><category term="research" /><summary type="html"><![CDATA[With the Euclid and Roman Space Telescope missions ready to image billions of galaxies, we’ll need data-driven methods to find new, rare phenomena that exist outside human-defined taxonomies! Sparse Autoencoders (SAEs) can be that discovery engine, surfacing interpretable features in modern galaxy surveys. This blog post highlights some preliminary results from our tiny NeurIPS ML4PS workshop paper, jointly led by Mike Walmsley and me. Read the paper here.]]></summary></entry><entry><title type="html">Galaxy environments and graph neural networks</title><link href="https://jwuphysics.github.io/blog/2025/07/gnns-galaxy-environments/" rel="alternate" type="text/html" title="Galaxy environments and graph neural networks" /><published>2025-07-31T00:00:00+00:00</published><updated>2025-07-31T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/07/gnns-galaxy-environments</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/07/gnns-galaxy-environments/"><![CDATA[<p>This post discusses how graph neural networks (GNNs) can model the galaxy–halo connection within its large-scale surroundings. Dark matter structures, which seem to account for most of the mass in the Universe, can be represented as nodes in a cosmic graph. But dark matter—which solely interacts via gravitation—is also much easier to simulate than the messy baryons, whose magnetohydrodynamics are computationally expensive. By exploiting the representational power of GNNs, can we predict galaxies’ <em>baryonic</em> properties purely using simple dark matter-only simulations? Yes we can!</p>

<p>Note: this post is a continuation of a previous <a href="https://jwuphysics.github.io/blog/2025/06/graph-neural-networks-in-astrophysics/">introduction to GNNs in astrophysics</a>. Special thanks to <a href="https://astrockragh.github.io/">Christian Kragh Jespersen</a>,<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> who opened my eyes to the incredible power of GNNs for astrophysics! He also has several papers showing that graphs provide strong representations for <em>galaxy merger trees</em> (see <a href="https://ui.adsabs.harvard.edu/abs/2022ApJ...941....7J/abstract">here</a> and follow-up <a href="https://ui.adsabs.harvard.edu/abs/2024ApJ...965..101C/abstract">here</a>).</p>

<h2 id="the-galaxyhalo-connection">The galaxy–halo connection</h2>

<p>In the ΛCDM cosmology, galaxies live in dark matter subhalos<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> (see, e.g., the review by <a href="https://www.annualreviews.org/content/journals/10.1146/annurev-astro-081817-051756">Wechsler &amp; Tinker</a>). While dark matter dominates the mass content of the Universe, we can only directly observe the luminous signatures from galaxies that reside within. Our goal is to determine whether galaxy properties, such as its total stellar mass, can be predicted purely from dark matter simulations.</p>

<p>To solve this problem 20 years ago, a technique called “subhalo abundance matching” was proposed. The goal is to connect simulated dark matter subhalos to galaxy populations based on the latter’s <a href="https://ui.adsabs.harvard.edu/abs/2004ApJ...609...35K/abstract">stellar masses</a> (or <a href="https://ui.adsabs.harvard.edu/abs/2004MNRAS.353..189V/abstract">luminosities</a>). By rank-ordering the subhalo masses and assigning them to rank-ordered galaxy stellar masses, abundance matching imposes a monotonic relationship between the two populations.</p>

<p>This simple technique is capable of connecting galaxies to their host halos. However, it also assumes that galaxy evolution is not dictated by anything <em>but</em> the dark matter halo properties. Therefore, abundance matching fails to account for each galaxy’s large-scale environment!</p>

<h2 id="to-the-cosmic-web-and-beyond">To the cosmic web and beyond</h2>

<p>We’ve known for a long time that galaxy properties depend on their surroundings (see, e.g., <a href="https://ui.adsabs.harvard.edu/abs/1980ApJ...236..351D/abstract">Dressler’s famous 1980 paper</a>). The exact nature of how this plays out is uncertain; does galaxy environment induce different mass accretion or merger rate? Do overdense environments superheat or exhaust cool gas needed to fuel star formation? Or do large-scale tidal torques alter galaxy properties over cosmic timescales? We don’t really know the answer!<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> But empirically, we do know that the galaxy–halo connection also varies with environment.</p>

<p><img src="/images/blog/illustris-cosmic-web.jpg" alt="Illustris TNG simulation of galaxies amidst the cosmic web. Red to white colors indicate ionized hot gas, while the blue-purple colors indicate dark matter density." /></p>

<h3 id="overdensity">Overdensity</h3>

<p>Some attempts have been made to account for galaxy environment. For example, “overdensity” is a common parameterization of the mass density on large scales (see, e.g., <a href="https://ui.adsabs.harvard.edu/abs/2006ApJ...645..977B/abstract">Blanton et al. 2006</a>). Whereas a typical galaxy’s gravitational influence extends to a few hundred kpc, the overdensity can quantify the average density out to many Mpc. However, by taking a simple average over all mass in this spherical volume, the overdensity parameter is not sensitive to local variations in mass.</p>

<h3 id="disperse">DisPerSE</h3>

<p>Another popular technique called <a href="https://www2.iap.fr/users/sousbie/web/html"><em>DisPerSE</em></a> aims to measure topological structures in the cosmic web, e.g., voids, filaments, sheets, and clusters. DisPerSE is short for Discrete Persistent Structure Extractor, and the general intuition for how it works is by: (1) computing a density field from the simulation particles, (2) identifying critical points of the field like minima, saddle points, and maxima, (3) tracing out the “skeleton” between critical points, and (4) filtering features by their topological persistence, ensuring only robust, noise-resistant structures are kept. We can thus describe galaxy environment by using the distances to these DisPerSE features.</p>

<h3 id="cosmic-gnns">Cosmic GNNs</h3>

<p>Christian and I recognized that the entire simulated volume of galaxies could be represented a single cosmic graph, and subsequently modeled via GNNs! You can see a visualization of this below (Figure 1 of <a href="https://arxiv.org/abs/2306.12327">Wu &amp; Jespersen 2023</a>).</p>

<p><img src="/images/blog/WuJespersen2023-Fig1.jpg" alt="A subgraph from the IllustrisTNG 300 simulation, where subhalos are connected on 5 Mpc scales." /></p>

<p>We used matched runs of the Illustris TNG 300 dark matter only (DMO) + hydrodynamic simulations, i.e., the DMO simulation can only form dark matter (sub)halos, whereas the hydrodynamic run begins with the same initial conditions and forms similar (sub)halos as its DMO counterpart, but also includes messy baryonic physics. This means that we can map hydrodynamic galaxy predictions using a cosmic graph constructed from DMO simulations!</p>

<p>We treat each subhalo as a node in this cosmic graph, and specify two DMO node features: the total subhalo mass (M<sub>subhalo</sub>) and the maximum circular velocity (V<sub>max</sub>).</p>

<p>To determine the graph connectivity, we imposed a constant <em>linking length</em>. Pairs of galaxies “know” about each other if they have smaller separations than the linking length, so we connect those pairs of nodes with graph edges. We also compute six edge features using the nodes’ 3D positions and 3D velocities; these edge features record the geometry of the system in a E(3) group-invariant way.</p>

<p>As for the GNN model architecture, we use a graph network analogous to those described by <a href="https://arxiv.org/abs/1806.01261">Battaglia et al. 2018</a> that we had seen <a href="https://iopscience.iop.org/article/10.3847/1538-4357/ac8930">successfully applied in cosmology</a>. If you really want to see the code, <a href="https://github.com/jwuphysics/gnn-linking-lengths/blob/main/src/painting_galaxies.py">take a look here</a>.</p>

<h2 id="so-how-do-overdensity-disperse-and-gnns-compare">So… how do overdensity, DisPerSE, and GNNs compare?</h2>

<p>To cut to the chase: <strong>GNNs dominate the competition when it comes to predicting galaxy stellar masses from DMO simulations.</strong></p>

<p>The figure below shows how different environmental indicators, quantified over various distance scales, affect the prediction error on M<sub>star</sub>. Lower error is better, and you can clearly see how GNNs (purple) surpass all other methods once they’re given information on &gt; 1 Mpc length scales. (Figure adapted from <a href="https://ui.adsabs.harvard.edu/abs/2024ApJ...976...37W/abstract">Wu, Jespersen, &amp; Wechsler 2024</a>.)</p>

<p><img src="/images/blog/gnn-environment-performance.png" alt="Figure 2 from Wu et al. 2024, showing how different ML models achieve different prediction errors for estimating galaxy stellar mass; the GNN in purple is the best." /></p>

<p>Specifically, we compare machine learning models where no environmental data is provided (yellow), the DisPerSE cosmic web features (green), simple overdensity averaged over a given length scale (blue), and GNNs with graph connectivity on the given length scale (purple). The non-GNN models employed here are <a href="https://interpret.ml/docs/ebm.html"><em>explainable boosting machines</em> (EBMs)</a>—decision tree models that are both performant and interpretable. EBMs can receive environmental features on top of the M<sub>subhalo</sub> and V<sub>max</sub>: think of them as additional columns in a tabular dataset. We can provide EBMs with the collection of DisPerSE features, specify the overdensity on scales ranging from hundreds of kpc to 10 Mpc, or leave out environmental summary statistics altogether.</p>

<p>I want to highlight two main takeaways:</p>
<ol>
  <li><strong>Overdensity on 3 Mpc scales is the best simple environmental parameter</strong>. Excluding the GNN model, we find that an EBM with spherically averaged overdensity achieves the lowest error for stellar mass predictions. It even outperforms the DisPerSE cosmic web features!</li>
  <li><strong>GNNs are the undisputed champs</strong>. A GNN flexibly processes information on larger scales, and performance continues to improve to the largest distance scales that we test (10 Mpc).</li>
</ol>

<p>Cosmic graphs are a natural fit for the data, so it’s no surprise that they perform so well. Critically, we construct the graph such that the subhalo position and velocity information is <strong>invariant under the E(3) group action</strong>; we convert these 6D phase space coordinates into edge features. We’ve also seen hints that this method works in spatial <em>projection</em>, i.e. using 2D spatial coordinates and radial velocities (e.g., see <a href="https://arxiv.org/abs/2306.12327">Wu &amp; Jespersen 2023</a> and <a href="https://arxiv.org/abs/2411.12629">Garuda, Wu, Nelson, &amp; Pillepich 2024</a>).</p>

<p>Furthermore, the galaxy–halo connection has different characteristic length scales at different masses. Therefore, the optimality of 3 Mpc overdensity is somewhat specific to our simulation volume and subhalo mass selection. This is another reason to prefer GNNs, which can simultaneously learn the galaxy–halo–environment connection over a huge range of masses and distances.</p>

<p>Graphs adeptly model systems where individual objects are separated by relatively large scales—I mentioned this in the <a href="https://jwuphysics.github.io/blog/2025/06/graph-neural-networks-in-astrophysics/">introduction</a>. Meanwhile, much of my research has focused on extracting <em>local</em> information from galaxy systems at the pixel scale by using <a href="https://jwuphysics.github.io/tags/#computer-vision">vision models</a>. We can even combine these two representations by placing a convolutional neural network (CNN) encoder at each node, and letting the GNN process the pixel-level details in tandem with other galaxy parameters (see <a href="https://arxiv.org/abs/2407.13735">Larson, Wu, &amp; Jones 2024</a>)!</p>

<p>In summary, cosmic graphs offer a more natural and powerful way to represent the large-scale structure of the Universe than traditional methods. By using GNNs, we can effectively learn the complex relationship between a galaxy’s environment and its properties. In the future, I expect that GNNs will enable new ways to connect simulations to the observable, baryonic Universe.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Christian has also written a fantastic blog post on our papers together <a href="https://astrockragh.github.io/project/gnn_environment/">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>A subhalo is a dark matter halo that is gravitationally bound to a more massive halo. Sometimes the subhalos are called <em>satellites</em> and the most massive halo in the system is the <em>central</em> halo. The virial radius of the Milky Way’s halo is about 300 kpc, so nearby dwarf galaxies like the LMC and SMC are expected to reside in subhalos that orbit around the Milky Way halo. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Christian and I are investigating the <em>equivalence</em> of information content in galaxy assembly history and large-scale environment. Stay tuned for an upcoming paper! <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="galaxies" /><category term="graphs" /><summary type="html"><![CDATA[This post discusses how graph neural networks (GNNs) can model the galaxy–halo connection within its large-scale surroundings. Dark matter structures, which seem to account for most of the mass in the Universe, can be represented as nodes in a cosmic graph. But dark matter—which solely interacts via gravitation—is also much easier to simulate than the messy baryons, whose magnetohydrodynamics are computationally expensive. By exploiting the representational power of GNNs, can we predict galaxies’ baryonic properties purely using simple dark matter-only simulations? Yes we can!]]></summary></entry><entry><title type="html">Clear Vision, Clear Communications</title><link href="https://jwuphysics.github.io/blog/2025/07/clear-vision/" rel="alternate" type="text/html" title="Clear Vision, Clear Communications" /><published>2025-07-10T00:00:00+00:00</published><updated>2025-07-10T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/07/clear-vision</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/07/clear-vision/"><![CDATA[<p><a href="https://okbjgm.weebly.com/uploads/3/1/5/0/31506003/11_laws_of_showrunning_nice_version.pdf">The Eleven Laws of Showrunning</a> by <a href="https://en.wikipedia.org/wiki/Javier_Grillo-Marxuach">Javier Grillo-Marxuach</a> is full of useful advice for management and operations. Nominally, it’s about how to deliver a television show, from ideation to writing to production to postproduction, but there’s a ton of guidance that’s surprisingly relevant for working with large language models (LLMs).</p>

<p>Check out this blurb:</p>

<blockquote>
  <p>When you’re a showrunner, it is on you to define the tone, the story, and the characters. You are NOT a curator of other people’s ideas. You are their motivator, their inspiration, and the person responsible for their implementation.</p>

  <p>Bottom line: the creativity of your staff isn’t for coming up with your core ideas for you, it’s for making your core ideas bigger and better once you’ve come up with them. To say “I’ll know it when I see it” is to abdicate the hard work of creation while hoarding the authority to declare what is or isn’t good.</p>
</blockquote>

<p>This is the number one failure mode I see for people just starting to use LLMs. Inexperienced users usually give a short, poorly specified prompt, and then hope that the LLM will read their minds and magically respond by following their <em>intent</em>, rather than what they’ve literally written in the prompt. These users are giving vague directions for some answer, and then implying “I’ll know it when I see it.” Sorry folks, AI isn’t telepathy.</p>

<p>I’ve discussed this a bit <a href="https://jwuphysics.github.io/blog/2025/04/four-ways-i-use-llms/">before</a>, but here’s another astronomy-focused example.</p>

<p>Imagine you’re excited by the new <a href="https://dp1.lsst.io/">LSST Data Preview</a>, and you want to try out some new research ideas. You ask ChatGPT “<em>What are some research questions I can answer with the new LSST data?</em>” It lists some <a href="https://chatgpt.com/share/686fa7aa-14f8-8001-957e-080d6a67a8be">generic high-level stuff</a> that’s probably summarized from some old white papers. You think to yourself, <em>Wait, I don’t care about all these topics, I just wanted topics in galaxy evolution. And maybe using machine learning. Oh and also I don’t care about predicting photo-zs better, everybody has already been trying to do that for decades. Oh yeah and only use data that can be crossmatched with these value-added catalogs.</em> This is probably going to be a back-and-forth process, wasting your time, polluting the LLM context, and probably leaving you frustrated and without any good research ideas.</p>

<p>Let me propose a better alternative. Spend 5 minutes thinking about the essence, the specifics of what you’re looking for. You can jot down your prior research ideas, examples of research topics you don’t care about, extra information that you know is relevant but the LLM might not index on. Think of it as <em>building a trellis</em>, upon which the LLM can expand outward and fill inward. Here’s a <a href="https://chatgpt.com/share/686fa727-6a84-8001-80b6-b215f191de42">more fruitful example</a> of how I’d converse with ChatGPT.</p>

<p>When working with LLMs, it is on <strong>you</strong> to define the tone, the core ideas, the new insights. Carefully crafting and communicating this vision is a foundational skill, useful for personal brainstorming or managing an academic research group — it certainly goes beyond just LLM prompting or showrunning!</p>

<hr />

<p>Thanks to <a href="https://simonwillison.net/2019/Feb/19/eleven-laws-showrunning/">Simon Willison’s blog</a> — that’s where I first heard about this.</p>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="advice" /><category term="llms" /><summary type="html"><![CDATA[The Eleven Laws of Showrunning by Javier Grillo-Marxuach is full of useful advice for management and operations. Nominally, it’s about how to deliver a television show, from ideation to writing to production to postproduction, but there’s a ton of guidance that’s surprisingly relevant for working with large language models (LLMs).]]></summary></entry><entry><title type="html">Worlds we impose</title><link href="https://jwuphysics.github.io/blog/2025/06/worlds-we-impose/" rel="alternate" type="text/html" title="Worlds we impose" /><published>2025-06-18T00:00:00+00:00</published><updated>2025-06-18T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/06/worlds-we-impose</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/06/worlds-we-impose/"><![CDATA[<p>In the book <a href="https://en.wikipedia.org/wiki/Impro:_Improvisation_and_the_Theatre"><strong>Impro: Improvisation and the Theatre</strong></a>, Keith Johnstone recounts a moment between a teacher and a special needs student. The teacher holds up a flower and says, “Look at the pretty flower.” The girl responds, “All of the flowers are beautiful.” Then the teacher gently says, “but this flower is especially beautiful.” The girl proceeds to scream and thrash about violently.</p>

<p>The way Johnstone characterized this interaction surprised me:</p>

<blockquote>
  <p>In the gentlest possible way, this teacher had been very violent. She was insisting on categorising, and on selecting. Actually it is crazy to insist that one flower is especially beautiful in a whole garden of flowers, but the teacher is allowed to do this, and is not perceived by sane people as violent. Grown-ups are expected to distort the perceptions of the child in this way. Since then I’ve noticed such behaviour constantly, but it took the mad girl to open my eyes to it.</p>
</blockquote>

<p>Basically, to reject another’s world is violence. Even if done in a “gentle” way (like this teacher had done), it’s still an act of violence.</p>

<p>As a father of two, I often have to resist this urge to impose my world, my perspective, upon my kids. My daughter sees something she wants to share with me, but I instinctively want to respond by reshaping it into my perspective. Or convert it into some teaching moment, to insist on some fragment of my reality. But such a response to their <a href="https://www.gottman.com/blog/want-to-improve-your-relationship-start-paying-more-attention-to-bids/">bid for attention</a> is what Johnstone calls “blocking”, and he discusses it at length throughout the book.</p>

<p>This has been on my mind because I practice it daily now. If you use large language models (LLMs), then you probably do as well.</p>

<p>In order to actually get any value out of your interactions with LLMs, you need to construct its world, e.g. by providing context, constraints, and specific objective. Prompting (or <a href="https://x.com/tobi/status/1935533422589399127">context engineering</a>) is that “violent imposition” — pushing your reality onto the machine.</p>

<p>It’s not true to say that all such interactions are violent in this way. Parents tell their kids not to run into traffic. We teach them knowledge and skills that might broaden their world. The AI safety community seeks to align LLMs with human values. It’s not a bad thing to provide guidance. And again, skilled prompting is necessary to get any utility from LLMs.</p>

<p>However, I’m quite concerned about what this practice does to our<em>own</em> psyches. What happens when you spend hours each day reformatting the world context of a LLM, which can never resist? The way that AI generally interacts is to comply with whatever you say (or at least attempt to do so).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Real life is never this frictionless! And it shouldn’t be… each person has their own perspectives, and <em>most</em> people aren’t thrilled about having a worldview subjugated upon them.</p>

<p>What happens when we get too good at making LLMs see things our way? I’m guessing that it’ll make us even more siloed or unwilling to change our perspectives (even more than what social media has already done).</p>

<p>The equivalent of <em>touching grass</em> in this case is to spend some conscious effort <em>not</em> imposing our worlds on others. Maybe even LLMs too! After all, improv<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> is all about accepting what your partner gives you and building on it.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Also, gross <a href="https://openai.com/index/sycophancy-in-gpt-4o/">sycophancy</a>… and it looks like the latest version of Gemini 2.5 Pro is <a href="https://thezvi.substack.com/i/165786957/general-reactions-to-gemini-pro">falling into this same trap</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I should probably add the caveat that I’ve never done improv, but it’s on my bucket list! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="advice" /><category term="llms" /><category term="personal" /><category term="philosophical" /><summary type="html"><![CDATA[In the book Impro: Improvisation and the Theatre, Keith Johnstone recounts a moment between a teacher and a special needs student. The teacher holds up a flower and says, “Look at the pretty flower.” The girl responds, “All of the flowers are beautiful.” Then the teacher gently says, “but this flower is especially beautiful.” The girl proceeds to scream and thrash about violently.]]></summary></entry><entry><title type="html">Graph neural networks in astrophysics</title><link href="https://jwuphysics.github.io/blog/2025/06/graph-neural-networks-in-astrophysics/" rel="alternate" type="text/html" title="Graph neural networks in astrophysics" /><published>2025-06-09T00:00:00+00:00</published><updated>2025-06-09T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/06/graph-neural-networks-for-astrophysics</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/06/graph-neural-networks-in-astrophysics/"><![CDATA[<p>Many physical phenomena exhibit relational inductive biases and can be represented as mathematical graphs. In recent years, graph neural networks (GNNs) have been successfully used to model and learn from astronomical data. This post provides an introductory review to GNNs for astrophysics.</p>

<p class="notice">This is the first few sections of an invited review article that’s been sitting around for far too long…</p>

<h2 id="introduction">Introduction</h2>

<p>Machine learning algorithms have become increasingly popular for analyzing astronomical data sets. In recent years, astronomy’s wealth of data has engendered the development of new and specialized techniques. Many algorithms can learn relationships from catalogued (or tabular) data sets. Vision methods have been adopted across astronomy, e.g., through the use of convolutional neural networks (CNNs) for pixel-level data such as images or data cubes. Time series data sets can be represented using recurrent neural networks or attention-based models. Recently, simulation-based inference and generative models have also become commonplace for solving complex inverse problems and sampling from an implicit likelihood function. I don’t cover these topics here, as other reviews have surveyed the rise of <a href="https://ui.adsabs.harvard.edu/abs/2023mlpa.book.....A/abstract">ML applications throughout astronomy</a>, deep learning for <a href="https://ui.adsabs.harvard.edu/abs/2023PASA...40....1H/abstract">galaxy astrophysics</a>, and for <a href="https://ui.adsabs.harvard.edu/abs/2022arXiv220308056D/abstract">cosmology</a>).</p>

<h2 id="inductive-biases-of-physics-problems">Inductive biases of physics problems</h2>
<p>Because astronomical data can be structured in various ways, certain model representations are better suited for certain problems. This representational power is tied to the <em>inductive bias</em> of the problem. Multi-Layer Perceptrons (MLPs) or decision tree-based methods operate well on catalog-based data or unordered sets; that is, the permutation of rows or examples does not matter, and the features are treated independently. A CNN is well-suited for data on some kind of pixel or voxel grid; here the features are correlated with each other and have some notion of distance. Graphs are able to represent relationships between entities. See reviews on GNNs, e.g. by <a href="https://arxiv.org/abs/1806.01261">Battaglia et al. (2018)</a>, <a href="https://link.springer.com/book/10.1007/978-3-031-01588-5">Hamilton (2020)</a>, <a href="https://arxiv.org/abs/2104.13478">Bronstein et al. (2021)</a>, and <a href="https://www.nature.com/articles/s43586-024-00294-7">Corso et al. (2024)</a>, just to name a few.</p>

<h2 id="what-are-gnns">What are GNNs?</h2>

<p>Graphs are well-suited for representing entities and relationships between them; for example, a “ball and stick” model of a molecule represents atoms as nodes and bonds as edges on a mathematical graph. Another example is a social graph, where people, businesses, and events are different types of nodes, and interactions between these entities (i.e. mutual friends, event attendees, etc.) are edges on the social graph. In addition to the connective structure of the graph, nodes and edges can also be endowed with features. For the molecular graph, node features may comprise positions, atomic weight, electronegativity, and so on.</p>

<p>Because graphs are very general structures, they can offer tremendous flexibility for representing astronomical phenomena. Importantly, they also exhibit <strong>relational inductive biases</strong> (e.g., <a href="https://arxiv.org/abs/1806.01261">Battaglia et al. 2018</a>). Objects that are well-separated from each other are most naturally suited to reside on graph nodes. For example, a galaxy cluster can readily conform to a graph structure: galaxies can be represented as nodes, while interactions between pairs of galaxies (such as gravity, tidal forces, ram pressure, to name a few) can be represented as edges. The circumgalactic medium may be more challenging to represent as a graph, as there exists a continuum of gas densities in multiple phases, each with potentially different lifetimes, making it difficult to draw the line between individual clouds.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>A graph neural network (GNN) is a machine learning model that can be optimized to learn representations and make predictions on graphs. In this post, I highlight current and future astrophysical applications of GNNs.</p>

<h2 id="constructing-graphs-from-astronomical-data">Constructing graphs from astronomical data</h2>

<p>Before applying a GNN, we’ll need to first construct a graph from our data. The choice of how to define nodes and edges also determines how you might model the data via GNNs. In general, point clouds can be easily represented as nodes on a graph. Objects that are small relative to inter-object separations are natural candidates for nodes, like galaxies, subhalos, stars, or star clusters. The edges, which represent relationships or interactions, can be defined in several ways:</p>
<ul>
  <li><em>k</em>-Nearest Neighbors (k-NN): An edge is drawn from a node to its <em>k</em> closest neighbors in physical or feature space. This method ensures that every node has the same number of connections (degree), which can be useful for batching data on a GPU.</li>
  <li>Radius-based: An edge is drawn between all nodes separated by a distance less than a chosen radius <em>r</em>. This is a common choice for representing physical interactions that have a characteristic length scale. Unlike k-NN, this method results in a variable number of connections per node.</li>
  <li>Dynamically: Edges can also be learned dynamically by the model itself, for example, by using an attention mechanism to weight the importance of connections between nodes.</li>
</ul>

<p>The choice of graph construction method imposes a strong prior on the model, and the best choice will depend the problem.</p>

<h2 id="a-primer-on-mathematical-graphs">A primer on mathematical graphs</h2>

<p>A graph with \(N\) nodes can be fully described by its adjacency matrix, \(\mathbf{A}\), a square \(N \times N\) matrix that describes how nodes are connected. If an edge connects node \(i\) to node \(j\), then element \(A_{ij}\) has a value of 1; otherwise it is 0. Physical systems are often approximately described by sparse graphs, where the number of edges \(M \ll N(N-1)/2\). This approximation holds if, for example, interactions or correlations between nodes fall off rapidly with distance. A sparse adjacency matrix can also be efficiently represented using a \(2 \times M\) matrix of edge indices. The graph \(\mathcal{G}\) may contain node features \(\mathbf{X}\) and edge features \(\mathbf{E}\), where</p>

\[\mathbf{X} = \begin{pmatrix}
    x_1^\top \\
    \cdots \\
    x_N^\top
\end{pmatrix}
\quad {\rm and} \quad
\mathbf{E} = \begin{pmatrix}
    e_1^\top \\
    \cdots \\
    e_M^\top
\end{pmatrix}.\]

<p>Graphs have several characteristics that make them attractive for representing astrophysical concepts. Graph nodes have no preferred ordering, so the operation of a permutation matrix \(\mathbf{P}\) should yield the same graph as before. Critically, models that act on graphs (or sets; <a href="https://arxiv.org/abs/1703.06114">Zaheer et al. 2017</a>) can also be made invariant or equivariant to permutations. A permutation-invariant function \(f\) must obey</p>

\[f(\mathbf{X}, \mathbf{A}) = f(\mathbf{PX}, \mathbf{PAP^\top}),\]

<p>while a permutation-equivariant function \(F\) must obey</p>

\[\mathbf{P} F(\mathbf{X}, \mathbf{A}) = F(\mathbf{PX}, \mathbf{PAP^\top}).\]

<p>Note that the indices of the edge features are implicitly re-ordered if the permutation operation acts on the adjacency matrix.</p>

<h2 id="invariant-and-equivariant-models">Invariant and equivariant models</h2>

<p>As discussed above, GNNs are permutation-invariant to the re-ordering of nodes. This invariance reveals a symmetry in the system, as the permutation operator leaves the graph unchanged. Additional symmetries can be imposed on graphs and GNNs, for example, recent works have developed graph models that are invariant or equivariant to rotations and translations in \(3\) or \(N\) dimensions, e.g., (<a href="https://arxiv.org/abs/1612.08498">Cohen &amp; Welling 2016</a>, <a href="https://arxiv.org/abs/1802.08219">Thomas et al. 2018</a>, <a href="https://arxiv.org/abs/2006.10503">Fuchs et al. 2020</a>, <a href="https://arxiv.org/abs/2102.09844">Satorras et al. 2021</a>). The subfield of symmetries and representations in machine learning is sometimes called geometric deep learning, and there are far more detailed reviews offered by <a href="https://arxiv.org/abs/2104.13478">Bronstein et al. (2021)</a> or <a href="https://arxiv.org/abs/2105.13926">Gerkin et al. (2021)</a>.</p>

<p>Notwithstanding the far superior review articles mentioned above, I still want to briefly discuss the benefits of leveraging symmetries in astrophysics. While modern ML has demonstrated that effective features and interactions can be learned directly from data, imposing physical symmetries as constraints can vastly reduce the “search space” for this learning task. Perhaps the simplest symmetry is by only using scalar representations. While models that preserve higher-order representations can be more data-efficient (<a href="https://arxiv.org/abs/2207.09453">Geiger &amp; Smidt 2022</a>), a simple and powerful way to build invariant models is by contracting all vector or tensor features into scalars (e.g., dot products) at the input layer, as discussed in <a href="https://arxiv.org/abs/2106.06610">Villar et al. (2021)</a>. Nonetheless, models that allow higher-order internal representations can efficiently learn using fewer data examples.</p>

<p>Other popular models in ML are already exploiting many of these symmetries. Indeed, CNNs, which are commonly used for image data, and transformers, commonly used for text data, can both be considered special cases of GNNs. For example, a convolution layer operates on a graph that is represented on a grid; node features are the pixel values for each color channel, while linear functions over a constant (square) neighborhood represent the convolution operator. CNNs can learn (locally) translation-invariant features, although this invariance is broken if the CNN unravels its feature maps and passes them to a final MLP.</p>

<h2 id="a-simple-gnn-that-makes-node-level-predictions">A simple GNN that makes node-level predictions</h2>

<p><img src="/images/blog/example-gnn.png" alt="Example diagram of a GNN" />
Caption: Example of a simple GNN layer that makes node-level predictions. Node features \(x_i\), neighboring node features \(x_j\), and edge features \(e_{ij}\) are fed into a learnable function, \(\phi\), which outputs a hidden edge state \(\varepsilon_{ij}\). All edge states \(\varepsilon_{ij}\) that connect to node \(i\) are aggregated through \(\oplus_j\), a permutation-invariant aggregation function, and the concatenation of its output and the original node features are fed into another learnable function, \(\psi\), which finally outputs predictions at each node \(i\).</p>

<p>Here, we’ll briefly describe the simple GNN illustrated in the above figure. This general structure is often referred to as a <strong>message-passing</strong> framework. Let’s focus on predictions that will be made on node \(i\). For each neighboring index \(j\), we feed neighboring node features \(x_j\), edge features \(e_{ij}\), and the input node features \(x_i\) into a function \(\phi\) that produces a “message” or edge hidden state \(\varepsilon_{ij}\):</p>

\[\varepsilon_{ij} = \phi(x_i, x_j, e_{ij}).\]

<p>\(\phi\) is a function with shared weights across all \(ij\), and it is parameterized by learnable weights and biases. In practice, \(\phi\) usually takes the form of a MLP with non-linear activations and normalization layers.</p>

<p>An aggregation function \(\oplus_j\) operates on all edge hidden states \(\varepsilon_{ij}\) that connect to node \(i\), i.e., it pools over all neighbors \(j\). Common examples of the aggregation function include sum pooling, mean pooling, max pooling, or even a concatenated list of the above pooling functions. Crucially, the aggregation function must be permutation invariant in order for the GNN to remain permutation invariant.</p>

<p>The function \(\psi\) receives the aggregated messages back at node \(i\), as well as the node’s own features \(x_i\), in order to “update” the node’s state and make predictions:
\(y_i = \psi \left (x_i, \oplus_j(\varepsilon_{ij}) \right).\)
Similar to \(\phi\), \(\psi\) can be parameterized using a MLP or any other learnable function, so long as the parameters are shared across all training examples.</p>

<p>Although we described just one example of a GNN layer, it serves to illustrate how different kinds of features may interact. Many other alternatives are possible, see e.g., <a href="https://arxiv.org/abs/1612.00222">Battaglia et al. 2016</a>, <a href="https://arxiv.org/abs/1806.01261">2018</a>. It is possible to have graph-level features or hidden states that simultaneously act on all node or edge hidden states. Additionally, predictions can be made for the entire graph or on edges rather than on nodes, and likewise, other aggregation patterns are possible.</p>

<h2 id="prediction-tasks-on-graphs">Prediction tasks on graphs</h2>
<p>GNNs are versatile and can be adapted for various prediction tasks depending on the scientific question:</p>
<ul>
  <li>Node-level tasks: These tasks involve making a prediction for each node in the graph. For example, predicting the stellar mass of a galaxy (node) based on its properties and the properties of its neighbors. The model output is a vector of predictions, one for each node.</li>
  <li>Edge-level tasks: These tasks focus on the relationships between nodes. An example would be predicting whether two dark matter halos will merge, where the prediction is made for each edge connecting two halos.</li>
  <li>Graph-level tasks: These tasks involve making a single prediction for the entire graph. For instance, predicting the total mass (e.g., \(M_{200}\)) of a galaxy cluster (the graph) based on the properties and arrangement of its member galaxies. This usually involves an additional “readout” or “pooling” step that aggregates information from all nodes and edges into a single feature vector before making the final prediction.</li>
</ul>

<p>Our one-layer GNN described in this section can be extended in two different ways: (<em>i</em>) multiple versions of the learnable functions with unshared weights can be learned in parallel, and (<em>ii</em>) multiple GNN layers can be stacked on top of each other in order to make a deeper network. We now consider \(u = {1, 2, \cdots, U}\) unshared layers, and \(\ell = {1, 2, \cdots, L}\) stacked layers. For convenience, we also rewrite \(x_i\) as \(\xi_i^{(0, \ell)}\), \(x_j\) as \(\xi_j^{(0, \ell)}\), and \(e_{ij}\) as \(\varepsilon_{ij}^{(0, \ell)}\), where the same input features are used for all \(\ell\). (Note that the node and edge input features may have different dimensions than the node and edge hidden states.) With this updated nomenclature, each unshared layer produces a different set of edge states:</p>

\[\varepsilon^{(u,\ell)}_{ij} = \phi^{(u,\ell)}\left (\xi_i^{(u,\ell-1)},\xi_j^{(u-1,\ell-1)},\varepsilon_{ij}^{(u,\ell-1)}\right ),\]

<p>which are aggregated and fed into \(\psi^{(u,\ell)}\) to produce multiple node-level outputs:</p>

\[\xi_i^{(u,\ell)} = \psi^{(u,\ell)}\left (\xi_i^{(u, \ell-1)}, \oplus_j^{(u,\ell-1)}\left(\varepsilon^{(u,\ell-1)}_{ij}\right )\right ).\]

<p>The extended GNN can have a final learnable function \(\rho\) that makes node-level predictions from the concatenated hidden states:</p>

\[y_i = \rho\left (\xi_i^{(1,L)}, \xi_i^{(2,L)}, \cdots, \xi_i^{(U,L)}\right).\]

<h2 id="a-connection-to-multi-headed-attention">A connection to multi-headed attention</h2>

<p>Another way to say this is by representing \(h_i^{(\ell)}\) as the feature vector of node \(i\) at layer \(\ell\). Assuming that we aggregate all of the unshared layers at each \(\ell\), then \( h_i^{(\ell)} = \oplus_u(\phi^{u,\ell}) \). In that case, the input is \(h_i^{(0)} = x_i\) and a stack of \(L\) layers is then:</p>

\[\mathbf{h}_i^{(\ell+1)} = \text{GNN-Layer}^{(\ell)} \left(\mathbf{h}_i^{(\ell)}, \left\{ \mathbf{h}_j^{(\ell)}, \mathbf{e}_{ij} \mid j \in \mathcal{N}(i) \right\} \right).\]

<p>Within any single GNN layer, we can learn \(U\) different message functions in parallel — this is just like <strong>multi-headed attention</strong> (see <a href="https://arxiv.org/abs/1710.10903">Veličković et al. 2017</a>)! The outputs of these multiple heads \(\phi^{(1)}, \phi^{(2)}, \cdots, \phi^{(U)}\) can be concatenated (or aggregated) before the final node update:
\(\text{final_features}_i = \text{CONCAT}\left[ \bigoplus_j \phi^{(1)}(...), \bigoplus_j \phi^{(2)}(...), \dots \right].\)</p>

<p>Once we’ve extracted this final set of features, we can then pass it through a final learnable function \(\rho\) in order to make predictions.</p>

<h2 id="summary">Summary</h2>

<p>Graph neural networks (GNNs) provide a powerful and remarkably intuitive way to model astrophysical systems. By treating objects like galaxies and subhalos as nodes on a graph, we can leverage their physical relationships as edges, making it easier to build models that respect the fundamental symmetries of the problem.</p>

<p>I’ve written this post as a rather general introduction, but real examples can probably paint a clearer picture of how GNNs work. In an upcoming blog post, I’ll highlight some of my own work using these methods to learn the physical connection between galaxies, their subhalos, and their cosmic surroundings. Stay tuned, but if you can’t wait, then you can check out those papers <a href="https://arxiv.org/abs/2306.12327">here</a> and <a href="https://arxiv.org/abs/2402.07995">here</a>!</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Note, however, that even complex gas dynamics may still be modeled using GNNs. For example, <a href="https://www.science.org/doi/10.1126/science.adi2336">Lam et al. 2023</a> have successfully represented meteorological data on a polygon mesh, a specific type of graph, which enables them to leverage GNNs for weather forecasting. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="galaxies" /><category term="graphs" /><category term="review" /><category term="tutorial" /><summary type="html"><![CDATA[Many physical phenomena exhibit relational inductive biases and can be represented as mathematical graphs. In recent years, graph neural networks (GNNs) have been successfully used to model and learn from astronomical data. This post provides an introductory review to GNNs for astrophysics.]]></summary></entry><entry><title type="html">What a month of blog analytics taught me about social media platforms</title><link href="https://jwuphysics.github.io/blog/2025/06/blogging-social-media/" rel="alternate" type="text/html" title="What a month of blog analytics taught me about social media platforms" /><published>2025-06-02T00:00:00+00:00</published><updated>2025-06-02T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/06/blogging-social-media</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/06/blogging-social-media/"><![CDATA[<p>If you’re a blogger or researcher sharing your work online, you’ve probably wondered: is social media actually useful for disseminating your writing? I’ve been asking myself this question since <a href="https://jwuphysics.github.io/blog/2025/04/hello-world-again/">returning to blogging</a> just over a month ago.</p>

<p>So I decided to check some analytics. Since late April, I’ve been tracking where my blog readers actually come from when I share posts across different platforms. I share my results from this 30-day snapshot (late April through late May) below.</p>

<h2 id="the-data-where-readers-actually-come-from">The data: where readers actually come from</h2>

<p>Before diving into the numbers, a quick note on methodology. I use <a href="https://simpleanalytics.com/">SimpleAnalytics</a> because it respects visitor privacy (e.g., it respects blockers or “do not track” browser signals). This means some traffic sources might go untracked if users have strict privacy settings, but it gives us a decent view of the platforms that are actually driving traffic to my posts.</p>

<p>Over the past month, I’ve shared each new blog post consistently across three platforms: <a href="https://twitter.com/jwuphysics">Twitter/X</a>, <a href="https://bsky.app/profile/jwuphysics.bsky.social">Bluesky</a>, and <a href="https://www.linkedin.com/in/jwuphysics/">LinkedIn</a>. Most of these simply a sentence or a copy+paste of the front matter of the blog, sometimes with a screenshot of the post from my laptop or my phone. I’ve also been posting at random times (basically whenever a post is completed).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Okay, this isn’t really a rigorous scientific experiment… whatever.</p>

<p>When I examined the referral data after excluding direct links and other sources (a pretty large fraction of results), the distribution was a bit surprising:</p>
<ul>
  <li><strong>LinkedIn: 51.4%</strong></li>
  <li><strong>Bluesky: 26.0%</strong></li>
  <li><strong>Twitter/X: 22.7%</strong></li>
</ul>

<h2 id="twitterx-when-reach-doesnt-translate-to-readership">Twitter/X: When reach doesn’t translate to readership</h2>

<p>Twitter’s poor performance in driving actual blog readership is particularly pathetic when you consider the platform’s apparent reach. I’ve had several tweets gain (fairly?) significant traction, e.g. my <a href="https://x.com/simonw/status/1915423828987228385">blog migration tweet</a> was noticed by <a href="https://simonwillison.net/">Simon Willison</a>, and subsequently got 36,000 views. Yet despite his generous attention, the actual number of actual blog visits was comically low.</p>

<p>Another one of my posts got <em>over 9000</em> views on Twitter — not bad, right? But in fact, only 100 people had actually clicked through to read the full blog post. This represents roughly a 1% conversion rate, which suggests that Twitter’s engagement metrics are totally disconnected from genuine reader interest. In any event, most of my posts get only a few hundred views (i.e. less than a quarter of my follower count), since I don’t pay for that blue check mark (or use Twitter as its own microblogging platform, now that I’ve chosen to “own” my content).</p>

<h2 id="linkedin-steady-and-reliable-for-now">LinkedIn: Steady and reliable (for now)</h2>

<p>In contrast to Twitter’s up-and-down metrics, LinkedIn has been way more steady over the month. My posts on the platform typically generate between 800–4000 views. LinkedIn consistly delivers sustained visibility (often spanning multiple weaks) for my blog posts. And this seems to work: over half of my blog visits originate from LinkedIn! I was kind of shocked to see this, since academics and researchers rarely use LinkedIn, and the platform is generally known as a pretty low-signal source of information…</p>

<p>If there’s any social media platform that I might be more inclined to post on regularly, it would be LinkedIn. However, for now I’m not planning to change my usage patterns significantly. After all, we’ve seen what can happen when unhinged billionaires acquire social media platforms (and I’m not eager to invest heavily in a platform that could become even more pay-to-play overnight).</p>

<h2 id="bluesky-the-surprising-dark-horse">Bluesky: The surprising dark horse</h2>

<p>Bluesky has also been a surprisingly helpful platform despite its small apparent size. Although Bluesky still feels like a niche social media site<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> compared to the other two, it’s driving 26% of my social media referral traffic, placing it solidly ahead of Twitter.</p>

<p>On one hand, I actually have the highest follower count on Bluesky (among the three platforms). On the other, the Bluesky’s chronological timeline makes it much harder to go viral compared to its competitors. This design constraint probably favors consistent, regular engagement from regular bloggers, over the transient spikes that characterize viral posts on other platforms. Or maybe Bluesky simply has a more dedicated user base that actually spends more time connecting with others on the platform rather than scrolling past things without reading.</p>

<h2 id="what-this-means-for-writers-and-researchers">What this means for writers and researchers</h2>

<p>I’ve found that regular blogging has made me a better writer, and helped me <a href="https://jwuphysics.github.io/blog/2025/04/lowering-the-barrier-for-writing/">organize my thoughts and clarify my thinking</a>. It’s also served as a public record for my own future reference. These benefits exist regardless of whether anyone reads my posts!</p>

<p>This brief foray into my blog post analytics has reminded me of a lesson that’s easy to forget in the social media age: writing should be pursued for its own sake, not simply as fuel for social media engagement. The data certainly provides useful tidbits about platform effectiveness, but the more important takeaway is that I have very little control over social media platforms, and that expanding social media reach is totally orthogonal to writing a half-decent post.</p>

<p>Moving forward, I’m not planning to spend any more effort crafting platform-specific social media posts. Instead, I’ll focus on what actually matters: writing blog posts that help me think more clearly and document my academic and intellectual journey. Hopefully, if your writing is genuinely useful to you — i.e., it helps you understand something better, or articulate ideas you needed to work through — then readers will likewise find it valuable, regardless of which platform brought them there.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The time and day of week that you post makes a huge impact on social media engagement. I used to care about this a bit more, at least enough to recognize this factoid, but I’ve since become more constrained by <em>having two small kids</em> and <em>giving less of a crap</em>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I mean this in a good way! Bluesky actually has a legitimate astronomy community. Check out the various astronomy feeds — especially the <a href="https://bsky.app/profile/did:plc:jcoy7v3a2t4rcfdh6i4kza25/feed/research">AstroSci</a> feed for astronomy researchers! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="personal" /><category term="blogging" /><category term="social-media" /><summary type="html"><![CDATA[If you’re a blogger or researcher sharing your work online, you’ve probably wondered: is social media actually useful for disseminating your writing? I’ve been asking myself this question since returning to blogging just over a month ago.]]></summary></entry><entry><title type="html">The benefits of slow growth, misguided rabbit holes, and painful mistakes</title><link href="https://jwuphysics.github.io/blog/2025/05/slow-growth-rabbit-holes/" rel="alternate" type="text/html" title="The benefits of slow growth, misguided rabbit holes, and painful mistakes" /><published>2025-05-25T00:00:00+00:00</published><updated>2025-05-25T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/05/slow-growth-rabbit-holes</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/05/slow-growth-rabbit-holes/"><![CDATA[<p>I am a self-confessed <em>productivity junkie</em>. I hate wasting time. And if you scroll through social media, or even my blog posts, you might think that the typical research or learning process is just a happy, monotonic hill climb, capped off with regular announcements of new discoveries or gained expertise. But what if the most important lessons emerge not from unencumbered progress, but rather from seemingly aimless pursuits and the frustration of doing things badly? This post is a tribute to all those times we got stuck and emerged with nothing to show for it, because those “unproductive” moments lead to some of the most important lessons we can ever learn.</p>

<p>A lot of this post stems from my own experience, and I hope that they’re useful for you too. (But one of the takeaways here is that <em>sometimes you have to make your own mistakes</em> in order to learn.) Here are a few other blog posts that have impacted my thinking on productivity:</p>
<ul>
  <li><a href="https://www.benkuhn.net/impact/">Impact, agency, and taste</a> by Ben Kuhn</li>
  <li><a href="https://www.alignmentforum.org/s/5GT3yoYM9gRmMEKqL/p/hjMy4ZxS5ogA9cTYK">How I think about my research process</a> by Neel Nanda (note that there are three parts)</li>
  <li><a href="https://danluu.com/productivity-velocity/">Some reasons to work on productivity and velocity</a> by Dan Luu</li>
  <li><a href="https://jvns.ca/blog/2014/03/10/help/">Hacker School’s Secret Strategy for Being Super Productive (or: Help.)</a> by Julia Evans</li>
  <li><a href="https://guzey.com/productivity/">Every productivity thought I’ve ever had, as concisely as possible</a> by Alexey Guzey</li>
  <li><a href="https://www.paulgraham.com/top.html">The top idea in your mind</a> by Paul Graham</li>
</ul>

<p>I’m sure there are more that I’ve internalized, but can’t quite remember right now; feel free to reach out to me if you know of other interesting ones.</p>

<h2 id="the-exploratory-phase">The exploratory phase</h2>

<p>It’s hard to quantify the value in trying out new research directions. Obviously you can’t wander aimlessly <em>all the time</em>, or spend all your free time listening to <a href="https://notebooklm.google.com/">NotebookLM</a> audio overviews of random papers that piqued your interest. But many of my best ideas emerged despite me not beginning with a tangible goal.</p>

<p>Back when I was in grad school,<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> I noticed that a <a href="https://www.kaggle.com/competitions/galaxy-zoo-the-galaxy-challenge">Kaggle Galaxy Zoo challenge</a> was solved using deep convolutional neural networks (CNNs). I was very interested in applying deep learning to galaxies, so it was gratifying to see <a href="https://sander.ai/">Sander Dieleman</a> et al. accurately <a href="https://arxiv.org/abs/1503.07077">predict citizen scientist vote fractions of galaxy morphology</a>  purely using image cutouts.</p>

<p>Motivated by this successful application… I decided to proceed by bashing every project I could find with this newfound hammer. After all, I was a curious grad student wielding a powerful method. What else did you expect? Classifying galaxy morphology had already been done before, but I recognized that you could predict all sorts of other galaxy properties, e.g., whether galaxies were merging, separating compact galaxies from stars, predicting the bulge-to-disk ratio, etc.</p>

<p>Along the way, though, I noticed that nearly everyone was interested in <strong>classification</strong> problems, e.g., identifying galaxy morphological type or galaxy mergers, but these seemed to be an incredibly limited class of problems. After all, the cosmos is <em>weakly modal</em>, and although astronomers loved to classify things, these categories are honestly quite arbitrary.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> I was far more interested in <strong>regression</strong> problems, e.g., how does the galaxy’s star formation rate or chemical abundance scale with its appearance? Up until ~2017, very few people had addressed the idea of <em>regression</em> deep learning problems in astronomy.</p>

<p>Anyway, after a few months of going down random rabbit holes, I realized that there were loads of interesting regression problems that hadn’t been addressed with deep learning. I chatted with <a href="https://www.linkedin.com/in/theboada/">Stephen Boada</a>, and later on consulted with <a href="https://www.physics.rutgers.edu/~gawiser/">Eric Gawiser</a> about these ideas; we quickly honed in on the task of predicting galaxy metallicity from images. You can read more about that <a href="https://jwuphysics.github.io/blog/2020/05/exploring-galaxies-with-deep-learning/">here</a>.</p>

<p>These exploratory phases are helpful for letting your mind make free-form connections; diving down rabbit holes is basically feeding that part of your brain. But watch out for the slippery slope: it’s tempting to put out theories without ever figuring out how to evaluate (or invalidate) them. In other words, it’s fine to follow random streams of consciousness, but eventually you’ll need to land on a <em>well-posed</em> research question. Otherwise, you’d never crystallize any kind of problem worth solving!</p>

<p>I think about this transition as going from the <em>exploratory/meandering</em> phase to <em>mapmaking</em> phase. That transition happens once you have a falsifiable hypothesis, after which you can begin charting out a plan to implement those tests. Let’s talk about the <em>mapmaking</em> phase.</p>

<h2 id="from-meandering-to-mapmaking-distilling-down-to-a-one-sentence-hypothesis">From meandering to mapmaking: distilling down to a one-sentence hypothesis</h2>

<p>One of the most important lessons I’ve learned is this: <strong>Whenever you are in an exploratory phase, look for every opportunity to distill your ideas into a testable, one-sentence hypothesis.</strong></p>

<p>Side note: LLMs are extremely helpful here! As described in a <a href="https://jwuphysics.github.io/blog/2025/04/four-ways-i-use-llms/">previous post</a>, under the heading <em>1. Exploring ideas and surveying prior art</em>, I lean on LLMs to (i) critique my vague thoughts, (ii) decompose promising ideas into atomic concepts, and (iii) survey the literature to see whether these ideas have been implemented before. If you’re interested in critiquing your thoughts, then you must avoid <a href="https://openai.com/index/sycophancy-in-gpt-4o/">LLM sycophancy</a> at all costs! Try a prompt based on something like this:</p>

<blockquote>
  <p>Please critique my thoughts on <em>Topic_X</em> (appended below). Is my hypothesis vague or incomplete? How might it be tested? Has it been done before? Include diverse opinions or parallel ideas from the literature on <em>Topic_X</em> or related concepts.</p>

</blockquote>

<p>If you can’t (eventually) articulate a testable hypothesis, then you should be slightly worried. Either you are still learning the ropes for a new topic (<em>good!</em>), or you are familiar with a topic/method but cannot figure out what you want to do with it (<em>not good!</em>). Give yourself a hard deadline (e.g. a week) to distill a one-sentence hypothesis from all the information you’ve gained while chasing down rabbit holes, and if you still can’t come up with anything concrete, then put those thoughts on the backburner.</p>

<p>As soon as you come across a new idea, rigorously consider the following:</p>
<ul>
  <li>Do I understand the method well enough to sketch out the pseudo code?</li>
  <li>Do I understand the prior art and potential, e.g., is it sufficiently novel and/or impactful?</li>
  <li>Can I write down a testable hypothesis?</li>
</ul>

<p>If you can’t address these questions, then put the idea on the backburner. In fact, for more experienced researchers, you’ll want a much tighter feedback loop; each of these questions should be answerable within a few minutes. I come up with a dozen nebulous ideas every day, so it’s imperative that I set a five minute deadline for constructing a hypothesis, and if it fails to meet that bar, then I let it sink back into my subconscious.</p>

<p>Alternatively, there are cases in which it’s better to have a <strong>strong</strong> rather than a quick feedback loop. I’ll touch on that in the next section.</p>

<p>But once you <strong>do</strong> find a testable hypothesis, then try to write it down into a single sentence. This can be tricky, but the point of the exercise is to practice conveying the essence of the hypothesis, and winnowing out extraneous details. Once you have something specific enough to actually disprove, and you’re satisfied that it captures the core research question you’d like to solve, then congratulations! You’re done exploring (for now) — it’s time for mapmaking.</p>

<p>Here’s a concrete example of when I failed. Around 2020, I got interested in generative models for galaxies. “I want to apply VAEs/GANs/diffusion models to astronomy” sounds great when you’re reading the DDPM paper, but it’s also a completely vague and unfalsifiable thought — there’s no scientific question in here. You could spend months on that without it amounting to anything. But instead, we could start thinking about more testable hypotheses:</p>
<ul>
  <li>Can generative models construct galaxy images at high enough fidelity to forecast new survey observations? (<em>This is still too vague, but we’re getting closer.</em>)</li>
  <li>Can generative models predict JWST-like mid-infrared images from HST imaging to a reduced chi-squared value of about 1, thereby confirming a <a href="https://arxiv.org/abs/2503.03816">tight connection between galaxies’ optical and infrared light</a>?</li>
  <li>If we generate mock survey images based on <em>N</em>-body simulations with realistic galaxy morphologies and brightnesses, and select galaxies via aperture photometry based on rest-frame optical colors, then does it result in a biased matter power spectrum relative to doing the same with halo occupation distribution models that <em>don’t</em> include morphology?</li>
</ul>

<p>I’m not saying these are particularly good research ideas. But the second and third are definitely more testable than the first one, and each of those three are far more useful than “can we train VAEs/GANs/diffusion models over my favorite galaxy dataset?”</p>

<p>Specifying the hypothesis in this way makes it obvious that the hypothesis could be true or false. Better yet, it implies <em>how</em> the hypothesis might be validated or falsified. Still, we might have a vague but interesting idea, and we can think of multiple tests that could (in)validate parts of this unclear idea. In that case, we can hierarchically break down the target idea (e.g., <em>the latent space of generative models trained on galaxy images has a bijective map to the space of astrophysical properties</em>) into more specific hypotheses, like:</p>
<ul>
  <li>A generative model trained on SDSS galaxy image cutouts will have some linear combination of latent vectors that have high Pearson correlation coefficient with HI velocity width.</li>
  <li>A generative model’s latent vector that is positively correlated with inclination angle will also be anticorrelated with star formation rate from optical tracers.</li>
</ul>

<p>I want to admit that I’ve often gotten stuck with a tantalizing idea, but couldn’t (or didn’t) find a way to test a concrete hypothesis. For example, I wanted to make “superresolution” work in astronomy, and even wrote some <a href="https://jwuphysics.github.io/blog/2021/01/galaxy-unets/">blog</a> <a href="https://jwuphysics.github.io/blog/2021/12/galaxy-gans/">posts</a> about it. Whereas smarter researchers like Francois Lanusse came up with <a href="https://arxiv.org/abs/2008.03833">genuinely useful applications for generative modeling</a>, I was just messing around. <strong>I hadn’t ever left the exploratory phase, and although I believed I was mapping out a trajectory, I had no destination in mind!</strong></p>

<h2 id="the-irreplaceable-experience-of-doing-a-really-really-bad-job">The irreplaceable experience of doing a really, really bad job</h2>

<p>Let’s actually zoom out for a moment, because there’s an important lesson to be learned here.</p>

<p>I mentioned that it’s useful to have tight or quick feedback loops: arrive at testable hypotheses so that you can begin the real work of validating or falsifying them. This is the actual learning process. Aimless exploration often has the <em>appearance</em> of learning, but carries little substance… except for one critical meta-lesson. <strong>Sometimes you simply need to experience the feeling of aimlessness in order to learn how to overcome it.</strong></p>

<p>Perhaps you don’t know what kind of statistical methods are needed to confirm or falsify a hypothesis. Or maybe you need to collect a lot more data, and that’ll take a year. Or perhaps your PhD supervisor really thinks that the conjecture is true, but you can’t figure out how to confirm it. In all these cases, you’re left without any useful feedback, or an obvious way to proceed, so you just flounder about and watch weeks, months, maybe even years go by. <strong>You’re absolutely failing at your task. Keep it up!</strong></p>

<p>I can give a personal anecdote: I got totally stuck during my first research project in graduate school. And by stuck, I mean <em>I was absolutely going down random wrong rabbit holes and wasting my time</em>. I was tasked with “stacking far-infrared, dust continuum, and gas spectral lines from high-redshift cluster galaxies.” Stacking just means that, since we know the locations (and recession velocities) of a sample of galaxies, we can <em>average</em> how bright those galaxies are using some other measurements. Simple enough, right? But at the time, I didn’t know what I didn’t know:</p>
<ul>
  <li>At far-infrared wavelengths, there is a considerable background emission from high-redshift galaxies, and especially so in galaxy cluster fields where gravitational lensing can amplify the background flux.</li>
  <li>Far-infrared emission detected via bolometers generally has very low spatial resolution, so that if you are stacking lots of galaxies within a small angular size, then you’ll accidentally double count sources (that are well-separated at optical wavelengths, but completely “blended” or overlapping at long wavelengths).</li>
  <li>I was relying on a sample of galaxy clusters spanning redshifts between 0.3 &lt; <em>z</em> &lt; 1.1, meaning that my <em>flux-limited samples</em> would result in completely heterogeneous constraints spanning nearly two orders of magnitude in luminosity or mass.</li>
  <li>Ultimately, stacking would not have detected anything in the highest-redshift clusters unless the <em>average cluster member</em> had lots of dust-obscured star formation — a very bold assumption.</li>
  <li>There were a few galaxies that were actually detected in long-wavelength emission, but there were so few that it was impossible to make any kind of statistical statement about them.</li>
  <li>The whole project was extremely challenging, and probably could have been framed as “we expected to find nothing in these extremely massive galaxy clusters, and yet we found some rare dusty, gas-rich, star-forming galaxies!” (A member of my qualifying exam committee actually said this, but I didn’t take it to heart.)</li>
</ul>

<p>To be clear, my PhD advisor was extremely supportive and helpful in providing high-level guidance. I just happened to be pushing in a challenging research direction, and I was too inexperienced in astronomy to have salient intuitions on how to interpret my (null) results. After navigating these unexpected twists and turns over three years,<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> I finally had some results and a paper draft to circulate among my co-authors.</p>

<p>One of the co-authors (I won’t say who) remarked, “<em>Whew. That was a slog to read through. It really reads like a student’s first paper… you gotta make it more interesting than that.</em>”</p>

<p>I had dutifully reported the results from every misguided rabbit hole, mentioned our incorrect motivations, and carefully explained the statistical procedures that were far too weak to result in anything. But my co-author’s message cut through all the nonsense: <strong>Give your audience a reason to read this. Nobody cares about your hard work.</strong></p>

<h2 id="dont-skip-the-struggle-dont-repeat-the-struggle">Don’t skip the struggle, don’t repeat the struggle</h2>

<p>This was the lesson I needed to learn, and I’m not sure there were any shortcuts. Instead of just learning what <em>to do</em>, I had to suffer through and internalize what <em>not to do</em>. I was embarrassed at how inefficient I was and completely sick of this project.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> But I had to see it through.</p>

<p>Ultimately, it took over <strong>four years</strong> to get my <a href="https://arxiv.org/abs/1712.04540"><strong>first paper</strong></a> published. The end result wasn’t great; in hindsight, I could have written it much better, and I probably could have made it sound far more interesting.</p>

<p>And yet, I think this was one of the best things to happen to me as a graduate student. <strong>I truly believe that there’s no substitute for the experience of struggling to conclude a project, and then pushing through anyway</strong>. Each and every fruitless endeavor builds intuition for future projects. Critiques and criticisms hone your research taste. Slow learning by trial and error ensures that you have an airtight understanding of those topics.</p>

<p>The greatest part of this story is that I’ve already gone through this journey once, so I don’t need to repeat it! I’m sure that I’ll run into new obstacles and challenges, and that I’ll be frustrated with my lack of productivity, but — with the benefit of hindsight — I can appreciate the slow learning process for what it is.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I’ve previously written about my journey, including my serendipitous grad school years, <a href="https://jwuphysics.github.io/blog/2024/01/two-years-in-the-tenure-track">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>There’s a whole blog post in here, but in essence: many astrophysical phenomena exist along a continuum. That continuum <em>might</em> be bimodal, like elliptical vs disk galaxies, or broad-line vs narrow-line active galactic nuclei, or Type O/B/A/etc stars, but there is rarely a firm separation of classes. Sure, there are obvious class distinctions like star–galaxy separation, but you know I’m not talking about that. If you want to hear more, stay tuned for a future post, or check out this <a href="https://youtu.be/kpMXCcGyydU&amp;t=1529">MLClub debate</a> back in 2021. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>I pulled <em>so many</em> all-nighters during my first year of graduate school. Probably chasing down loads of random rabbit holes on that first project. And for what? None of it was useful for writing my first paper. And yet it <em>was</em> useful, because now I know that I don’t need to repeat the experience! <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p><a href="https://jegpeek.space/">Josh Peek</a> (and I’m sure there’s earlier attribution) often says, “hating your paper is a necessary but not sufficient condition to getting it published.” <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="academia" /><category term="personal" /><category term="productivity" /><summary type="html"><![CDATA[I am a self-confessed productivity junkie. I hate wasting time. And if you scroll through social media, or even my blog posts, you might think that the typical research or learning process is just a happy, monotonic hill climb, capped off with regular announcements of new discoveries or gained expertise. But what if the most important lessons emerge not from unencumbered progress, but rather from seemingly aimless pursuits and the frustration of doing things badly? This post is a tribute to all those times we got stuck and emerged with nothing to show for it, because those “unproductive” moments lead to some of the most important lessons we can ever learn.]]></summary></entry><entry><title type="html">Foundation Models in Astronomy</title><link href="https://jwuphysics.github.io/blog/2025/05/foundation-models-in-astronomy/" rel="alternate" type="text/html" title="Foundation Models in Astronomy" /><published>2025-05-16T00:00:00+00:00</published><updated>2025-05-16T00:00:00+00:00</updated><id>https://jwuphysics.github.io/blog/2025/05/overview-foundation-models-astronomy</id><content type="html" xml:base="https://jwuphysics.github.io/blog/2025/05/foundation-models-in-astronomy/"><![CDATA[<p>Here’s a casual introduction to foundation models and how they might impact astronomy research in the coming years. I’m writing this on the train back from New York to Baltimore, having just wrapped up the <a href="https://events.simonsfoundation.org/event/0aff2690-f1cb-485f-833a-429b6c7eb7ef/summary?tm=8eYg1qvbYoaoB-i3qiSGVDkdnLEYU8RX4tCGRKsTY_w">Foundation Models in Astronomy</a> workshop at the Flatiron Institute Center for Computational Astrophysics. My co-organizers and I are planning to write up a more comprehensive blog post based on our workshop discussions; in the meantime, you’ll just have to settle for this.</p>

<h2 id="foundation-models-are-here-to-stay">Foundation models are here to stay</h2>

<p><a href="https://crfm.stanford.edu/report.html">Foundation models</a> are the base pre-trained neural networks for large language models (LLMs) like ChatGPT or Claude, vision models like DALLE, and even automated speech recognition (ASR) models like the ones that automatically caption your Youtube videos.</p>

<p>These models learn representations of data that can distinguish between examples in the training dataset. However, they’re not really trained in the usual supervised fashion; instead, foundation models undergo <em>self-supervised</em> learning by optimizing a contrastive or generative objective.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Foundation models seek to learn <em>how</em> your data can be represented or generated. By minimizing a contrastive loss, you task your model to create similar representations for the same example “viewed” differently (or transformed differently under a data augmentation procedure), and different representations for different data examples. If instead, you minimize a generative loss, then you task your model with figuring out whatever representations are useful for generating another patch of the image or the next word in a text corpus. I’d wager that contrastive losses lead to stronger discriminatory power, and that generative losses lead to better generative power, but don’t actually have any data to support this intuition. Oh well.</p>

<p>The real power of foundation models is that (1) they can map your data into semantically meaningful embedding representations and (2) help catalyze specific downstream tasks.</p>

<h3 id="1-the-power-of-embedding-representations">(1) The power of embedding representations</h3>

<p>Why should you care about latent representations of your data (i.e. your embedding space)? By converting data into embedding vectors, you can use that embedding space to perform comparisons. Concretely, if your embedding space captures the semantic meanings of your dataset, then you’ll be able to measure the semantic similarity of two objects (e.g. by using a cosine similarity or some other distance measure). You can even learn a joint representation across multiple “modalities” such as text and audio.</p>

<p>For example, we used text embeddings to compare astronomer queries against arXiv paper abstracts when we sought to <a href="https://arxiv.org/abs/2405.20389"><em>evaluate LLMs for astronomy research</em></a>. By mapping both the user query and the paper abstracts into this embedding space, and storing the latter into a vector database, we could retrieve (hopefully) relevant papers on the basis of the user query. Over the course of the JHU CLSP <a href="https://www.clsp.jhu.edu/workshops/2024-jelinek-summer-workshop-on-speech-and-language-technology/">2024 JSALT workshop</a>, we dramatically improved the semantic similarity search pipeline and retrieval engine, which was published alongside many other cool results in the <a href="https://arxiv.org/abs/2408.01556">Pathfinder paper</a> by <a href="https://kartheikiyer.github.io/">Kartheik Iyer</a>. <a href="https://charlesponeill.com/">Charlie O’Neill</a> and <a href="https://christine8888.github.io/">Christine Ye</a> were also able to extract, disentangle, and interpret semantic concepts in the astronomy and ML literature by training sparse autoencoders over these <a href="https://arxiv.org/abs/2408.00657">paper embeddings</a>!</p>

<h3 id="2-all-purpose-base-models-for-any-task">(2) All-purpose base models for any task</h3>

<p>Building up this semantically rich representation of your dataset also provides an excellent starting point for any other machine learning task. We can view this pre-trained foundation model as a <em>base</em> for some later <em>downstream</em> task. For example, if a foundation model has seen all kinds of real-world images, and learned to produce self-consistent representations of the semantic content within those images, then it should be able to classify bird species or segment cars and pedestrians and roads in a much more data-efficient way.</p>

<h2 id="foundation-models-in-astronomy">Foundation models in astronomy</h2>

<p>Foundation models are also becoming common across astronomy! In the past few years, we’ve seen foundation models trained on galaxy image cutouts (e.g., by <a href="https://arxiv.org/abs/2012.13083">Hayat et al. 2020</a>, <a href="https://arxiv.org/abs/2110.00023">Stein et al. 2021</a>, and <a href="https://arxiv.org/abs/2405.14930">Smith et al. 2024</a>), stellar spectra (<a href="https://arxiv.org/abs/2411.04750">Koblischke &amp; Bovy 2024</a>), and even multiple modalities like images and spectra (<a href="https://arxiv.org/abs/2310.03024">Parker &amp; Lanusse et al. 2024</a>) or photometric and spectroscopic time series (<a href="https://arxiv.org/abs/2408.16829">Zhang et al. 2024</a>). And there are many more coming soon!</p>

<p>A critical question remains: Are people actually using foundation models to make new discoveries? In general, the answer is no. Most citations are simply from other papers that are also releasing their own ML models. A notable exception is from Galaxy Zoo,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> whose Zoobot model by <a href="https://arxiv.org/abs/2102.08414">Walmsley et al. 2021</a> has amassed ~200 citations leading to actual science! It remains to be seen whether current and next-generation foundation models will deliver real scientific value.</p>

<p>As I mentioned at the top, the workshop organizers will be writing up another blog post focusing on our discussions and how we might guide our community of astronomical ML practitioners. Stay on the lookout for that!</p>

<p><strong>Edit (2025-05-19)</strong>: I’m including a list of foundation models in astronomy that I currently know about. There are arguably more, e.g. autoencoder variants such as <a href="https://arxiv.org/abs/2211.07890">spender</a>, but I’m trying to focus on large-scale foundation models that will (hopefully) be able generalize well to many tasks. Feel free to reach out if you’re think I’ve made an egregious omission.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<table>
  <thead>
    <tr>
      <th>Foundation Model</th>
      <th>Domain</th>
      <th>Method</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>AstroCLIP (<a href="https://arxiv.org/abs/2310.03024">Parker et al. 2023</a>; <a href="https://github.com/PolymathicAI/AstroCLIP">Github</a>)</td>
      <td>Multi-modal (images and spectra)</td>
      <td>Contrastive</td>
    </tr>
    <tr>
      <td>Maven (<a href="https://arxiv.org/abs/2408.16829">Zhang et al. 2024</a>)</td>
      <td>Multi-modal time series (photometry and spectra)</td>
      <td>Contrastive</td>
    </tr>
    <tr>
      <td>AstroM³ (<a href="https://arxiv.org/abs/2411.08842">Rizhko &amp; Bloom 2024</a>)</td>
      <td>Multi-modal time series (photometry, spectra, and metadata)</td>
      <td>Contrastive</td>
    </tr>
    <tr>
      <td>*AstroPT-Euclid (<a href="https://arxiv.org/abs/2503.15312">Siudek et al. 2025</a>)</td>
      <td>Multi-modal (images and photometry)</td>
      <td>Generative</td>
    </tr>
    <tr>
      <td>FALCO (<a href="https://arxiv.org/abs/2504.20290">Zuo et al. 2025</a>)</td>
      <td>Kepler time-series</td>
      <td>Generative</td>
    </tr>
    <tr>
      <td>SpectraFM (<a href="https://arxiv.org/abs/2411.04750">Koblischke &amp; Bovy 2024</a>)</td>
      <td>Stellar spectra (synthetic &amp; real)</td>
      <td>Generative</td>
    </tr>
    <tr>
      <td>*Gaia spectra (<a href="https://arxiv.org/abs/2410.16081">Buck &amp; Schwarz 2024</a>)</td>
      <td>Stellar spectra (Gaia XP and RVS)</td>
      <td>Contrastive</td>
    </tr>
    <tr>
      <td>SSL for DESI Legacy Survey (<a href="https://arxiv.org/abs/2110.00023">Stein et al. 2021</a>; <a href="https://github.com/georgestein/ssl-legacysurvey">Github</a>)</td>
      <td>DESI Legacy Survey galaxy images</td>
      <td>Contrastive</td>
    </tr>
    <tr>
      <td>GZ-Evo (<a href="https://arxiv.org/abs/2206.11927">Walmsley et al. 2022</a>; <a href="https://github.com/mwalmsley/galaxy-datasets">Github</a>)</td>
      <td>Galaxy images (multiple observatories)</td>
      <td>Constrastive</td>
    </tr>
    <tr>
      <td>AstroPT (<a href="https://arxiv.org/abs/2405.14930">Smith et al. 2024</a>; <a href="https://github.com/Smith42/astroPT">Github</a>)</td>
      <td>DESI Legacy Survey Galaxy images</td>
      <td>Generative</td>
    </tr>
    <tr>
      <td>Radio Galaxy Zoo (<a href="https://arxiv.org/abs/2204.08816">Slijepcevic et al. 2022</a>)</td>
      <td>Radio-wavelength galaxy images</td>
      <td>Constrastive</td>
    </tr>
    <tr>
      <td>SSL for Radio Interferometric Images (<a href="https://arxiv.org/abs/2411.14078">Cecconello et al. 2024</a>; <a href="https://github.com/dr4thmos/solo-learn-radio">Github</a>)</td>
      <td>Radio interferometric images</td>
      <td>Constrastive</td>
    </tr>
    <tr>
      <td>SSL for LOFAR (<a href="https://arxiv.org/abs/2503.19111">Baron Perez et al. 2025</a>)</td>
      <td>Radio galaxy images (LoTSS-DR2)</td>
      <td>Constrastive</td>
    </tr>
  </tbody>
</table>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>You shouldn’t be surprised to find that Lilian Weng has incredibly comprehensive blog posts on <a href="https://lilianweng.github.io/posts/2019-11-10-self-supervised/">self-supervised learning</a> and specifically <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">contrastive learning</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Arguably, this is expanding the definition of a foundation model because it is being trained via <em>supervised</em> learning. Zoobot learns to predict vote fractions of citizen scientists’ morphological classifications. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>But if you send me your paper/method and I add it to this post, then I’ll add an asterisk so everybody will know ;) <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>John F. Wu</name><email>jowu@stsci.edu</email></author><category term="astronomy" /><category term="computer-vision" /><category term="foundation-models" /><category term="llms" /><summary type="html"><![CDATA[Here’s a casual introduction to foundation models and how they might impact astronomy research in the coming years. I’m writing this on the train back from New York to Baltimore, having just wrapped up the Foundation Models in Astronomy workshop at the Flatiron Institute Center for Computational Astrophysics. My co-organizers and I are planning to write up a more comprehensive blog post based on our workshop discussions; in the meantime, you’ll just have to settle for this.]]></summary></entry></feed>