I decided to spend two weeks hopping from continent to continent to take part in back-to-back astro-statistics-tech events: the COIN Residency Program and AstroHackWeek. A year after having left the field, formally speaking, I’ve chosen to make astronomy my hobby, taking “leave” to do research. It’s maybe not entirely sensible, but I’m doing this on my own terms. This blog is a report on things I learned that sleep-deprived mostly-barefoot fortnight.
First, a little background about the events.
The Cosmostatistics Initiative (COIN) is a collaboration that began in 2014 as a section of the International Astronomical Association (IAA) and brings together people across the Astronomer–Statistician spectrum to do some left-of-field research introducing new data analytic, statistical, and visualisation techniques to the astronomy community. The Residence Program happens once a year: we hang out in an apartment for a week, do some intense work on 2-3 projects well into the wee hours, write-up half the papers, and still get some sun. This year we found ourselves in the lovely, warm, city of Budapest.
Some of COIN on our day off to go sightseeing around Budapest. Credit: Pierre-Yves LaBlanche
AstroHackWeek (AHW), on the other hand, is a free-form event with elements of a workshop (pre-defined lectures) and a lot more making-it-up-as-we-go-along. Early on, 50 participants suggest topics they would like to learn about, identify one expert amongst the group and allow them to become teacher for an hour to a class of 10-20 (learning collectives are a brilliant idea!). Hack projects are the highlight, and are proposed both before and throughout the event; many of us will work on 2-4 at once. AHW also started in 2014, and was held this year at the Berkeley Institute for Data Science (BIDS).
AstroHackWeek getting settled in at GitHub HQ, San Francisco.
For completeness, I’m also going to mention dotAstronomy, a similar out-of-the-box unconference that started way back in 2008/9. It has evolved over the years, but by the time I attended dotAstro7 in Sydney in 2015, it had become a combination of idea-lectures, just one day of hack-projects, and a lot of unconference group discussions. More of the emphasis is on software/tech and education/communication.
OK, so here’s my brain-dump:
Mixture models are the result of combining models for different sub-populations or classes. This makes them relevant to both clustering classification routines and for dealing with outliers. You can never really tease the subpopulations apart; the point is to model the combined dataset. And maybe provide a probability for each data-point that it belongs to a specific class.
Some parameters of the model will be relevant to different subsets of the group. For example, for supernova data one needs to model individual light-curves (layer 1), properties of supernovae type Ia (layer 2), and cosmology (layer 3). I’m now convinced that at least half of all models are actually hierarchical, just not recognised and named as such.
Probabilistic Graphical Models
Probabilistic Graphical Models (PGM) are diagrams that are very helpful for communicating parametrizations of models. You have to learn the “notation”, but once you do, they make great visual aids (see an example in this paper). Parameters are described as distributions, data or constants. Relationships between parameters are noted. This is particularly good for describing hierarchical models.
Making your covariance matrix Gaussian is the first step to modelling correlated errors. This is a complicated subject, and GPs certainly have limitations (maybe Gaussian isn’t appropriate!) but it’s better than just diagonal matrix, and besides, they have useful properties that make things easier to calculate.
Jupyter (IPython) Notebooks
This was the first time I actively used Jupyter Notebooks for writing python code, and I was pleasantly surprised by the interactive features and formatted commenting. Perfect for small pieces of code and teaching/demonstration. However, I do have some questions/gripes (please let me know if there are solutions) :
- can you import a package/module written in a notebook? Sometimes we end up with a notebook version for development, and then a standard python file for importing.
- can’t use all emacs commands meaning I have to do more clicking with the mouse, which is why I tend to avoid interactive editors in general.
- how does one work collaboratively on the same notebook? Can git handle that?
To be fair, I have an old version of ipython notebook, so maybe these gripes no longer apply. I should talk to the Jupyter crew, one of whom I met at AHW.
Parallel programming in Python
I had thought that parallel programming wasn’t really possible in python: you could run code on multiple threads yes, but not really multiple cores. People use multiprocessing sometimes, but now I need to look into mpipool. Could be useful, if you have the mpiexec job launcher set up on your cluster.
Natural Language Processing & Web-scraping
Despite being astronomers-by-trade, you’ll often find us talking excitedly about everything fascinating from outside our field. At a hack-week, we’re happy to give anything a shot. So after free dinner and drinks at GitHub HQ , we dreamt up the Happiness Hack (under a different name) and within 2 hours, created this.
It was going to end there, but the next day, we drummed up interest from the group and ended up extending the hack to grab** and analyse participants’ commit messages, as a bit of a joke, I guess, but here you go.
**beautiful-soup : holy crap!! So powerful, so beautiful…
Mock Turtle sings “beautiful soup”. Snippet of the drawing by Sir John Tenniel
Pair coding has been part of my life for the last few months, and I totally appreciate how it can really be more efficient despite the extra person investment. Just enough cooks. The small collaborations formed at both events worked wonderfully together, and several papers have been spawned. But really the big lesson, particularly from hacking at AHW, is that we benefit from learning to fail efficiently, because that sets us free to explore high risk projects. One person could hack away for weeks or months at an idea, while two or three people could declare it a lost cause in a mere day or two. Besides efficiency, this system prevents frustration and burn-out. Trying and failing was actively encouraged at AHW, and, better yet, demonstrated by senior participants.
Career transitions & Imposter Syndrome
Every time I meet with astronomers these days, the discussion turns to the process of leaving astronomy and imposter syndrome. The global community only really started talking about these on open forums about three years ago, and now it’s a recurring theme. At hack days/weeks, in particular, imposter syndrome is rife. Trying to prove your skills and worth and produce something spectacular on a short timescale is a recipe for mental health disaster. The pressure to dazzle with our hacking skillz certainly got to me back at dotAstro, but not as much this time, partly because the organisers made it a point to tackle the problem head-on (thank you!) and make the most of everyone’s diverse skill-sets, and partly because this time I knew better and put more emphasis on play and fun, and less on achieving goals.
So yeah, amongst the astronomy, statistics, computing, collaborating, hacking, and playing, I managed to learn a ton of stuff, see lovely places, and make new friends, which made the trip very worthwhile. My most important lesson, however, was:
Try not to doze off while on your laptop on the sofa near your colleagues, otherwise you end up with photos of creepy teddy bears watching you sleeping…