Polymers

This looks nasty, because the most obvious solution here is to build a map function that takes each pairwise letter and produces a new combination, and then combines all the resultant combinations, but I have a feeling that this is going to expand into memory hogging space really quickly. There's something about the question here that makes me wonder if there's an easier way to calculate teh totals, but I can't think of it yet.

So here we go with the version that is going to grow a lot over just 10 times.

Start with a string, and use a lookup table.

The example lookup table says something like "CN->C", but I suspect that we should think about the substitutions we want to make, so CN->CCN. That's a complete substitution.

However, as we run over something like NNCB, if we turn NN->NCN and NC->NBC and CB->CHB then we're going to end up with NCNNBCCHB with duplicated letters. Instead if we crop the first letter, we end up with NN->CN, NC->BC and CB->HB, we end up with "N" plus CNBCHB.

That's a much simpler mapping, and we can turn the initial table into that lookup form fairly easily I hope. Let's give that a try

Ok, let's try some expansion, so each iteration is equal to the pairwise mapping of letters, looked up in the substituion chart. Luckily python has us already covered for pairwise iteration of a collection, in itertools with the well named "pairwise" function

Let's do that on production data... gulp

More steps

As I thought, 30 more steps. I suspect this doubles in complexity with each step taken.

More importantly, if there are in fact around 2 trillion "B"'s in the string, then we're going to need around 2TB of ram to just hold the "B"'s in bytes, so that's not going to work.

We're going to need a different data structure and different algorithm, and I've not had the week to think about that I'm afraid.

Some overnight thinking

I've had some sleep, I've had a chance to think and more importantly, I've had a look at Day 15 and I'm excited to get to it, but I think I can solve this.

I was helped by contact in a slack I share, Joe, who said "You don't need to store the whole string to know how many letters there are".

This made me think that I've been thinking of this, somewhat deliberately as long strings like NBBNBNBBCCNBCNCCNBBNBBNBBBNBBNBBCBHCBHHNHCBBCBHCB. But as I said right back at the beginning NNCB is actually 3 pairs, NN, NC and CB. We do this already with the pairwise function, but the mapping I had was to turn NN-> NCN, and I optimised it because I didn't want to catonate all the strings together and have repeating polymers. But in fact if I think of the mapping as going from NN to ("NC", "CN"), what I've got is a list of pairs. These pairs don't matter about order, so I can just keep a count of the pairs at each iteration, and that will enable me to calculate the next iteration just by applying the mapping.

What gets hard now though is working out the maths for the duplicate atoms. If we take the 3 pairs NN, NC, CB that came from NNCB, we'd count 3 N's and 2'C and 1 B, which is too many because the N in NN and NC is shared and the C in NC and CB is shared.

Before I think about that, let's try the pairwise atom mapping and see if it give us the right answer for iteration 1 and 2. NNCB -> [(NN), (NC), (CB)] pair[NN] -> [(NC, CN)] pair[NC] -> [(NB, BC)] pair[CB] -> [(CH, HB)] So [NN, NC, CB] -> [NC, CN, NB, BC, CH, HB]

Our next iteration does the same mapping and turns us into [NC, CN, NB, BC, CH, HB] -> [NB, BC, CC, CN, NB, BB, BB, BC, CB, HB, HC, CB] that result can also be described slightly more succinctly as [NB:2, BC:2, CC:1, CN:1...]. That starts to compress fairly well because we're just keeping counts, and we know that the 2 NBs in the next round will result in 2 more NB's and 2 more BNs. Now our total number of pairs in iteration 2 is 12, but the actual length of the polymer is 13. Once we lose the order of the pairs, we completely lose the ability to reconstruct the polymer chain. But we don't care about the chain, what we care about is the actual counts.

Hmm, I'm going to have to cheat at this point. I cannot reason out a simple way in which I can keep sensible count of the duplicates as we go. Especially once you start getting to the fact we have 2 bc's in the second chain with no context as to what they are next to.

Off I go to one of my coworkers, Gerald Benischke who is doing the advent of code this year as an opportunity to flex his Haskell. What I like about Gerald's blogposts is that like me, he's trying to explain his thinking as well as the code. This days one is no different: https://beny23.github.io/posts/advent_of_code_2021_day_14/. Gerald mostly covers the how and why of currying and the haskell maps and some neat recursion tricks, but the most important part for me is this bit

So now that we’ve got pairs and their occurrences, we can combine them to count the letters. The idea was that now we’ve got the pairs, I would look at the first letter of each pair and count that.

To take this onto the small scale, if we look at NNCB, there are three pairs (NN, NC, CB) and if we look at the first letter of each pair AND add the very last letter, then we get back to NNCB. So similarly, when we count up all the pairs and occurrences, we can use this logic (first letter of each pair, plus 1 occurrence of the last letter) to create a count without needed exponential computing power and memory.

That's a really smart idea. I was stuck for a moment there on "How do I know what the last latter is", and then it struck me. The polymerisation process always inserts the new atom in the middle of the polymer, so NNCB can also be seen as N**B, and each of the stars will endlessly expand, but it will always start with an N and end in a B. A quick check of the test inputs shows that to be true for the test (although my day14 input has different letters there, so we can't hardcode N and B, but need the actual value from the template). Python lets us do template[-1] to get the last item from a string, so that's easy.

A word on "cheating"

I can feel some people asking "Surely it's cheating to look at someone elses solution". I don't see it that way, but let me explain why I don't and how Advent of Code can help us in our software development careers.

Advent of Code is a programming puzzle, for someone like me who spends 8 hours a day in meetings coordinating things, it's an opportunity to write code and make sure that I haven't forgotten what the experience is like. It's also a chance to try some logical and mathmatical puzzles that I'm not familiar with.

But it's also a lot like working as a software dveloper, and very few of us work well in isolation. I was pretty lucky that after about 5 years of my first software development jobs (touchscreen bingo, PS Vita and Xbox games, some finance trading software) I fell into one of the best and biggest agile development projects happening in the UK. I'd been a fan of agile software development from my first job and I'd read the books, but I got to practice it when I joined the Guardian newspaper. And joining a team of around 100 software developers who pair-programmed day every day taught me huge amounts about learning from others.

But what it taught me most of all is that there is no cheating when you rely on the knowledge of others. The business wants the code to work, and the best way you can understand a problem is to talk to someone who has dealt with it before.

For something like the Advent of Code, what constitutes cheating will vary for different people. I'm not doing it for time or points, I'm doing it to exercise my ability to implement code. If I copied and pasted code from another python person and said "don't know why it works, but it does", I'd be disappoionted in myself. But even if I'd copied code from another developer, if I had reasoned trhough how it works, looked at their implementation and thought "Yes, I understand how this works and I could write it myself now", then I've learned something.

All this to say, don't put artificial barriers in the way, especially for hobbies like Advent of Code. If you look at a problem and think I don't know how to solve that, you can look at wikipedia, at chat rooms or at other peoples stuff. You also don't need to continue in a language or method that isn't fun anymore just because you started that way. You might even implement something one way or in one language, and then go back and change it once you've got a wqorking solution to match you desired approach.

You do you.

Anyway, back onto the code. Let's make maps of the results, and then count the first letter of each pair, plus the final letter to get our counts

Ok, so we've got a way to turn a string into our counting pairs to start the whole process, and then we can iterate those based on our mapping.

Let's sort the counting problem now

Ok, that works, now lets give it a run with 10 iterations on the test data

Massive, lets give that a go on the real data and then for 40 counts as well.