TezTalks Radio - Tezos Ecosystem Podcast
TezTalks Radio - Tezos Ecosystem Podcast
118: Inside the Role of Chief Baker at Tezos Foundation, Chris Pinnock
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
This week on TezTalks Radio, host Brandon Langston is joined by Chris Pinnock, Chief Baker at the Tezos Foundation.
Baking is often described simply, but in practice it sits at the center of everything: consensus, signing, security, coordination, and infrastructure.
🔍 In this episode, we explore:
- What a Chief Baker actually does beyond the title
- Where pressure shows up first in the baking layer
- The kinds of risks and failure modes that matter most
- What separates a smooth week from a difficult one
- How BLS signature aggregation works in practice
- What changed when BLS moved from theory to live infrastructure
- How the Tezos Foundation structures and operates multiple bakeries
- The trade-offs between simplicity and resilience
- Real moments where the system was under pressure and what was learned
- How incident response works when speed and caution both matter
- The biggest shifts in baking over the past few years
- What challenges lie ahead as Tezos continues to scale
Welcome back to Tez Talks. Today we're joined by Chris Pinnock, Chief Baker at the Tezos Foundation. Chris works close to one of the most important parts of the chain, the baking layer. That means consensus, signing, operations, security, and the infrastructure that has to keep working when the stakes are real. What I wanted to do with this conversation was get the usual surface-level intro stuff and get into the actual work, what this role looks like day to day, where the pressure shows up, how the foundation's own bakeries are run, what BLS changes in practice, and what all this looks like from the inside when things aren't just theory anymore. Chris, thanks for taking the time.
SPEAKER_02No, no, no problem. No problem. But unfortunately, though, the licensing hours meant that you were having a diet coke, probably.
SPEAKER_00Well, you know, sometimes I used to work nights, so first call isn't uh too out of the question some days. Now, before we get into the deeper stuff, I want to start at the ground floor. You've got one of those titles that instantly makes people curious. Chief Baker. Once the novelty of the title wears off, what does that actually mean day-to-day? What does your world look like in practice?
SPEAKER_02Yeah, so it's not it's not my real title, of course. Um, it's my real title is head of IT, which is very boring. Um, it's much more fun to go around wearing a baker's hat. Um, I'm beginning to get confused for pastry chefs, as as we've we've discovered many times in in Tezdev last year. I don't know what's going to happen next year, but last year the uh the catering staff came over and asked me if uh about the pastries. Um so yeah, it's it's uh it's a silly title, really. Um what it means is I run the foundation bakers. We have eight bakers, and uh we have about 15% of the stake, I think, off the top of my head, and we're securing the network by running the bakers and um keeping that stake staked. So that's what I do. That's part of what I do. I mean, I have uh joking aside, I normally have about four or five jobs depending on the the season, but um I also run security audits uh for Tezos Foundation, uh including for our partner companies. Um I run the IT, so basic file and print in the good old days. Um, and then the Zbaker job, and then also I've done all sorts of things over the years. I was the Twitter handle for about four months for both Tezos and Tezos Foundation. I was the guy that and did all the ads and hashtags, stuff like that. But yeah, just generally, you know, whatever they give me, I don't mind. I love it, I love it.
Where Validator Pressure Really Shows
SPEAKER_00So I like starting there because people hear baking and they can easily reduce that to a simple label, as you've said. I mean, how are your mince pies? When in I bet, I bet when in reality this is live infra, right? There's a lot writing on it. Yeah, so when you think about the baking layer as something you're responsible for in the real world, not just on paper, where does the real pressure tend to show up first?
SPEAKER_02Well, okay, so so baking is our name for validating, right? And if if you were talking to us, what five years ago, we'd probably call it mining. Uh but we're on a proof of stake chain, so the terms become validator, right? Um I'll be honest, the pressure isn't particularly high actually. You you install the software and it and it kind of works most of the time. Uh I upgrade the systems when I need to, uh things keep running, and it's relatively fault tolerant. So if something falls over, um you can just restart it at worst case. Um, I've had one one or two nodes get themselves into a little bit of a panic over the years, but you just restore them and away you go. So it is a pressure-free environment, to be quite frank. I mean, uh you probably didn't notice, but half an hour before this call, I updated the down nodes for the to the foundation bakers.
SPEAKER_00So well done. I didn't by what the the these things happen magically, apparently. Yes. So the baking layer it secures real value and has to hold up under real conditions. Can you tell us about a specific challenge or moment where the baking layer was actually under pressure and how you responded?
SPEAKER_02Yeah, we had we had um we had we did have a few um times about two years ago when when the bakers just started to miss. And it it turned out that I'm I'm not I'm not blaming him because he did probably did it for a good very good reason. But my my previous the previous Chief Baker, it's a bit like Doctor Who, you know, after six, you can you can get anyone to do it. Um but basically yeah, the previous Chief Baker had somewhere set the number of connections on the node to 500. And so you can you can restrict the number of connections on the node, and you you're building up a table of peers, and so the default number is usually good, unless the network's small, like on a test net. And actually, if if you want to go, you could probably choose 20, something like that. But this number got set to 500, and so it was maintaining quite a large table of every other node. So that started to be a problem. That that was that was about it. Um we haven't had any um haven't had any signing issues, we haven't had any speed issues with that. Um, and we did have quite a complicated setup to sign a block because it has to go for a number of layers, it has to go for a network load balancer and uh an API, and then it hits hits a hardware security module, but it does all that within milliseconds. Um, so no, we we've never we've never had any particular issues with that. We've had one or two software issues, but normally fixes straight away. But yeah, nothing nothing serious.
Double Baking And Key Risk
SPEAKER_00Touch wood or wood. Every consensus mechanism has failure modes people worry about. From your seat, what kind of threat or class of risk do you respect the most when it comes to Tezos baking?
SPEAKER_02Um well I suppose someone could come along and run if if if there were enough people, they could come and run run their own software and full kit, but our governance process kind of prevents that because our our bakers have a way to vote for what they want. I mean, the the biggest problem we've got, uh the biggest problem is if you double bake, of course, and that's where you produce two blocks. So every baker is scheduled to bake a slot, uh unlike say Bitcoin, where everyone piles in and tries to mine the block, right? And the winner is the one that gets the the reward. But um you have a slot, and so if you if by some infrastructure failure you you had produced two blocks at that slot, you might produce two different blocks, and one of them might have a transaction in it, which doesn't get added to the change straight away. So, and that there's all sorts of other problems as well, like double you can get into all sorts of other other issues there. So, double baking is the worst thing you can do, and the way you protect yourself from that is you have your keys in one place and you protect them with a remote signer that understands the block number. So it says, I've signed this block before, I'm not signing it again. So that that's the biggest um key key management is the biggest thing that a baker faces. It's got nothing to do with infrastructure. Everyone goes, Oh, you know, if you go and talk to a cloud provider, maybe the search engine provider, maybe the bookstore, um, maybe the the seller of operating systems, they all go, we can put an AMI in our in our repository, and everyone can have your software. That's not the problem. You can install the software in under 10 minutes. We I did a video of that a couple of weeks ago. It's key management, keeping your key safe, uh, preventing double baking, and also preventing people taking your key. So that's that's really the bread and butter of baking.
SPEAKER_00Now, one of the reasons this topic matters is BLS. People hear that term here, but most listeners probably don't have a clean mental model for what it means in practice. So let's start there. BLS signature aggregation. It's a big part of what makes large-scale participation more workable, right?
SPEAKER_02Well, you you're gonna so the explanations I can give you is the bloke down the pub explanations, despite you know, as we were just saying earlier, I'm a chartered mathematician as of half an hour ago, but the the detail here is something that is that's underneath what I can tell you about. But uh we have this roadmap called Tezos X, and Tezos X is really about making the layer one um fast and simple, like small block times, fast settlement, um highly secure, so making sure all the blocks are what they say they are, and then moving up to a layer two where you've got the intelligence. And this is not a new idea. This if you look at computer systems all across you know, since time began, people simplify and they do these things. And the clap the classic from my background, because I'm a bit old, is MPLS networks. So around about 2000, I work for a company that's rolling out probably one of the first MPLS networks in Europe for the IP layer, and what that was doing was making a very fast core switching packets and putting all the intelligence on the edge. If you think now when you do your Netflix stream to your home, you're actually talking to an edge location, it's got all the content on it, and you know, you're not really the the core is switching it all around and pushing it everywhere, but you you're dealing with something close to you. So that's kind of Tesla sex in a nutshell. I'll probably I'll probably get told off for that. But that that is one of the ways of looking at it. So, you know, the the smart contract contracts going forward are more likely to live in the layer two and execute there, and but the settlement will always be in the layer one, as will the staking and the baking. So um we need better ways to improve our speed, and one of the ways to do that is to use the BLS keys uh to sign testations, and these have all sorts of properties that allow you to do lots of fast stuff, uh right, in layman's terms. So you've got you've got the aggregated testations, you've also got ways of squashing down the amount of information that you have to transfer around the network, which obviously then makes it faster. So by switching to these addresses, we will be able to go faster and simpler. So that's the good news. Do you want do you want the bad news?
SPEAKER_00If it is indeed bad, if you're if you're allowed to share, at least you know.
SPEAKER_02No, no, it's not it's not really bad news, it's a slight um dis I suppose not a disadvantage is I suppose the BLS standards is quite I don't think there is a standard actually, I think it's quite cloudy, so you've got slight in implementation differences. So at the moment there's not actually a hardware security module, as far as I know. There might have this one popped up in the past three months. There's not a hardware security module that uh will sign a BLS key. I think I think a ledger is capable of doing it, but it's quite slow at doing it. So you can't use that as a backing store for a baker that needs to be fast, because that's what you're trying to do. Um, so you've got to change your strategy about how you store the key. And so rather than using a hardware security module, you're now using a secure compute environment where you're holding the key and it's encrypted and no one can get in, and the only door in is is via a remote signer that answers on, say, one port. So that that's the slight disadvantage of it. But then the advantage of that is because it's on a compute instance, is it's faster. So you haven't got to go all the way into an HSM now again, you're actually just going in and out of a um secure compute environment, which is going to be a lot faster. So some good news, some bad news, some indifferent news.
Foundation Baker Setup In Practice
SPEAKER_00Okay. So now we've got the cryptographic piece on the table. The other side is the actual operational footprint. Because the foundation's not thinking about baking as one machine sitting in a room, right? No, so I want to talk about the foundation's own bakeries, if we can. Oh, it's top 10. Okay. We'll just move on. No, I'm kidding. When people hear that the foundation runs multiple bakers, what are they usually missing about how that setup actually works?
SPEAKER_02Um, well, I mean, we we have so we have eight bakers, we have four sites. Uh, we have four, so at those four sites, we have an HSM cluster. So Amazon have a nice single tenanted uh hardware security module product that passes all sorts of security things, that means it's single tenant and no one else can get on it. As opposed to perhaps some of the other key management solutions where you're 99.9999999% sure that no one else can get on it unless there's a massive software failure, and then everyone can get on it. Um, but if you're trying to pass an audit like we are, because we're a Swiss foundation, uh, we want something where you've got absolute certainty. So in those four sites, we have hardware security module clusters, they're they're spread across different availability zones, which essentially means they've got a geographic resilience to them. They're in the same area, but they're apart, they're all backed up to each other. Um, the system backs them up. You can restore that backup in any location in the world. It's a great, it's a lovely product, Cloud HSM. And um, then in front of that we have a remote signer and we're we're using Signatory. We did use our own, but we switched to Signatory about uh about a year ago um to simplify our life. And the the team at ECAD did a good job helping us with that, um, put a few features in for us. Um, and then that all lives in a very secure Amazon account, and anyone that's done an Amazon um associate examination will go, Yay, because there's actually an exam question on how you connect a very similar setup. So so you you're you're going over the the baker is living in a different account to the keys and it's connected over the AWS backplane, it doesn't go over the internet and it's highly secure, and um all it can do is sign blocks. Um, or if if we have a we if we do a plan work, we can go in with uh in a four-yes fashion. So basically we have a supervision layer and someone supplies a second code to break the glass to get in, and we can we can do things like transferring funds, but no one person can get at the funds, which is a key, key thing. Um so yeah, it's it is it is complicated, and also in in front of the baker we have front nodes. It was a feature that nomadic did for especially for us, I think, back in 2018. So you you have a node, and then our baker can live in a private network uh and not have an in you know internet connection at all, really, other than a private one to the nodes. We done us we did a similar setup for an ether link operator as well. So it's it's possible to run an ether link operator in the same way in a private private network and just talking to nodes that are on the on the network, and there's nothing special about those nodes.
Upgrade Night Incident In Miami
SPEAKER_00Well, you've been close to this for a while now, so I want to ask this in the clearest way possible. Tell me about a time when the system was under like I don't know, like was there was there any real issue with like I mean that you can talk about?
SPEAKER_02Um there was only one real issue that I ever had, um where it was a massive problem, and I was in Miami. So timing. Yes, so um there there is a little joke internally about um protocol upgrades and when I go on holiday. So first of all, I've got my holiday, then Jan's team injector protocol, and the date is always on my holiday. The last two actually we've missed. We've missed holidays, so there's no there's no bad omens here. Um I was in Miami and we'd had the upgrade, and the the day before I'd sat on Miami Beach and watched the sunrise, and that evening the upgrade had gone really smoothly, and then I moved house from from the beach into the town to go and watch some baseball, and I was just sitting there, and all of a sudden, all these alerts start coming in, and everything's wedged. Um and I had to redeploy everything. I really I had to, you know, I couldn't get into anything to restart it. I had to just sort of brute full stop it. And it turned out to be um a bug in the accuser. So the accuser sits there running, uh looking at the network and looking for double baking, looking for double attestations, and if it sees them, it puts out an operation to tell the baker to register that, and then things happen, slashing happens. And the accuser from the previous protocol had just started to sort of go like this loop on the CPU, and the memory footprint was going like that, and it just grew. And in quite a lot of compute environments now, like a Docker environment or something where you might not have swap or or enough memory, it just sort of can wedge, and that's what happened. So it was fortunately, it was a simple fix uh to turn the accuser off for the previous protocol, um, which you should you you should really do, but you know, you want to avoid restarting and stuff around that time, so we we left it. But yeah, that was probably probably the the worst thing that happened.
SPEAKER_00No, to me anyway. There are always situations that don't become disasters, but absolutely could have been bigger problems. Was there ever a near miss or an issue you caught early that taught you a lot?
SPEAKER_02No. I mean how clean shit you guys run over.
SPEAKER_00I know I know this is not interesting, but it's it is though. It is because I mean it you expect you expect there to be a story, you know, like oh just what every time Jeff down the road, he was late, so he went and got donuts because he was late. No, no, no.
SPEAKER_02We don't need donuts. Um no, I mean the the Octus design is is relatively um resilient. I mean, so you have you have your node that runs, you have your Dow node now, which which runs, talks to the node, does stuff, and then you have your baker, and that has a normally has a key attached to it, right? And a wallet attached to it, and that's a client of the node. So the baker can go down temporarily, come back again, connect to the node. Um and similarly, similarly, the smart roll-up demon as well, which we have for things like Etherlink. It's the same, it can go down for a bit, come back up. If you've got three front nodes, you know, two of them can go down and one can carry on helping your baker. So yeah, I I haven't had I haven't had too much catastrophe. This as I say, there was this 500 connection problem, which was a bit of a surprise. Um and then there was this this problem where you know there was something memory leaking because it was spinning basically. But nothing else I can remember really.
SPEAKER_00Well, uh despite you know the stability there, the baking. Has changed a lot over time. Looking back, what's the biggest shift that surprised you either technically or operationally?
Tenderbake Shift And Signer Work
SPEAKER_02Well, um the biggest the biggest upgrade I did was probably the second one when I got the uh that I had during the job, I think it was Ithaca, uh where we switched from the EMI consensus mechanism to Tenderbaker. So again, high-level pub explanations of MI. So M EMI is I think we're using Emi Plus actually, but it's similar to the uh algorithm in Bitcoin. So you you need five blocks to get finality, which is why on your Bitcoin you've got a time of what 10 minutes and that means 50 minutes, you've got a settled transaction, you know it's settled. So we switched to tender bake. Now tender bake allows you to do all that complexity, but with just two blocks. So you if you think about that situation, you've got a a block that's been baked but still sort of uh under a bit of contention, maybe there's another one that will come in, and then the next one comes along and it's it's fixed the previous one. So um if you've got a block time of 10 seconds, then your settlement time is 20 seconds for 10 to bake. Uh our block time, I think we're we're heading towards you know six seconds, round about that. So your settlement time is 12 seconds for a transaction. Um, when I go to the supermarket near me to buy a bottle of wine, which is very rare these days, you know, it takes me longer than 12 seconds to buy it. Right? So transactionally, we're very fast, and that's going to continue to improve with with the um BLS stuff. I digress, but that the big the biggest problem with that was we had to change our remote signer software to understand the new messages because this is when um pre-attestations came in as well. So you had pre-attestation, attestation, baking, and I had a guy working for me called Roland Um who went through and and fixed all that code up. So we had we had a period of time where our signer would work with Emi Plus and also with Tenderbake. So that's probably the biggest change that we made because we had to do a lot of tests, you know, double baking testing and stuff like that to check the signer was resilient. Um, but yeah, it's probably the most horrific change we've had to make, other than BLS. Talking of which, that's a page of duty.
Trade Offs Between Uptime And Safety
SPEAKER_00Do you want to take care of that real quick for free, sir? Well, sir, by this point, you've seen the job from inside the machine under pressure over time. I mean, that's where the more personal side of the conversation starts to matter. There's always some kind of trade-off in systems like this. When you're thinking through difficult calls in this role, what trade-off do you find yourself wrestling with most often?
SPEAKER_02Well, the the trade, the classic trade-off is uh with the Tesla setup is the um the avoiding double baking. So the first thing I mean when when I came when I joined the um foundation in July 2021, um I went to meet Arthur and we we played, I've told this story 500 times. The first thing Arthur said to me was don't double bake, right? Um and he's seen string, as of I actually seen strings of people come in, set up this lovely resilient setup, you know, seven bakers and you know 52 remote signers and 4,000 nodes. The problem is, if at any point those bakers become isolated and they think they're the primary one, and you haven't got a good remote signer set up, you double bake. So you don't actually need to be that resilient because it's in some sense it's better to go down for 10 minutes than not double bake. You might miss some blocks and rewards, but it's better for you if you don't double bake. So the only trade-off is you might not have the resilience you might expect at a different uh a different site where where you can now you can run two bakers to be absolutely clear. I my setup I can run two bakers, and I can uh my remote signer will prevent remote signing, but in in practice, the cost yeah, I'm doubling the cost there with two machines, it's not giving me much.
SPEAKER_00Well, Chris, listen. I gotcha. Well, looking back over the last few years of baking infrastructure, if you could go back and change one thing earlier, what would it be and why? Elastic block store.
SPEAKER_02So we were running, we were running since 2018. We've directly attached MVME disks, which are the fastest. And I think back in 2018, that was the way to do it. You had a um that was the best way to get a fast disk on on an AWS system. Now elastic block storage is horrifically fast. GP3 is far, they're still faster than what we use, but we're using GP3, and it's fast enough to run a node. Now, by doing that, that meant I could switch from a um an M6D instance to something a bit more cheap, cheaper, and use the the direct uh use the UBS storage. So that saved a load of cost as well as doing that. And um, I I I sort of wish done that a bit earlier, but no regrets.
SPEAKER_00Now, was there one thing that it just might have taken you a little too long to learn?
SPEAKER_02Um that's a very pointed question, sir. I've probably taken too long implementing BLS keys, actually, because I've been talking about doing it for about a year. Um, we haven't needed to do it straight away, um, but it's taken us a little while, I say us, it's taken me a little while to figure out um a good setup for it. And I've been working with a guy from inference, uh company inference that used to work for me at TF uh a couple of years ago. So we've we've come up with something where we can secure our BLS key using an HSM. And that's we basically we encrypt the keys and then we can pull them out using the HSM. So you get the benefits of having the HSM protection. Uh, and then we pull into a secure compute environment that no one can touch. So that's taken a bit longer than than I would like to do, but it's it's not straightforward, it's just picky work to get right.
Closing Thoughts On Real Responsibility
SPEAKER_00Yeah. Well, Chris, this is great. I appreciate you taking the time to walk through not just the technical side, but you know, how you kind of felt through all this. Now, for people listening, this is one of those conversations that helps you understand that baking's not just about passive background processes. There's a lot of real decision making, real coordination, and real responsibility under the surface. Chris Pinnock, Chief Baker at the Tezos Foundation. Thanks again for joining us.
SPEAKER_02Thanks for having me. All the best.