WEBVTT 00:00.000 --> 00:16.560 All right, for those of you who may know me, and I know that there are a few in the audience 00:16.560 --> 00:20.960 who do you know that I am, in fact, not Alfonso. 00:20.960 --> 00:25.240 Unfortunately, Alfonso couldn't make it, so I'm jumping in for him. 00:25.240 --> 00:30.520 So today, we're going to be talking about random seats and state machines, and basically 00:30.520 --> 00:34.880 I'm going to show you how we employ in something called deterministic simulation testing 00:34.880 --> 00:37.120 at polar signals in rust. 00:37.120 --> 00:39.160 We've actually done this before and go as well. 00:39.160 --> 00:42.640 If you're interested in that, you can come and talk to me afterwards. 00:42.640 --> 00:48.120 But long story short, we ended up building a new database. 00:48.120 --> 00:52.640 And if you're interested in that, you can check out Tours Talk from yesterday in the database 00:52.640 --> 00:54.760 room where he talked about, you know. 00:54.760 --> 00:58.800 The decisions that went into that, the technologies, and so on. 00:58.800 --> 01:02.760 If you don't know at all what deterministic simulation testing is, that's okay. 01:02.760 --> 01:06.360 We'll go over everything. 01:06.360 --> 01:11.080 But before we dive into the details, I want to kind of set the setting why this is interesting, 01:11.080 --> 01:13.200 why this is useful. 01:13.200 --> 01:17.800 And for that, we do actually have Alfonso joining us. 01:17.800 --> 01:23.760 So basically, a couple of years ago, Alfonso worked at a different database company. 01:23.760 --> 01:27.600 We're not a database company, but we happen to build one. 01:27.600 --> 01:36.080 And there is a bug that he had been assigned to that was incredibly hard to reproduce. 01:36.080 --> 01:40.920 It only ever produced and very, very large machines over a very long period of time. 01:40.920 --> 01:47.920 So, you know, we probably all have been in this kind of situation in our careers that we had 01:47.920 --> 01:51.200 bug reports that were very, very hard to reproduce. 01:51.240 --> 01:56.080 So, that's ultimately kind of the setting of all of this. 01:56.080 --> 02:03.680 And Alfonso was trying to reproduce this with really expensive machines. 02:03.680 --> 02:10.200 And he thought, you know, I can just add a little bit of latency here and there and I'll 02:10.200 --> 02:11.800 tickle out the bug. 02:11.800 --> 02:15.840 And he thought he reproduced it, but it turns out that was actually not it. 02:15.840 --> 02:20.760 It was just a fluke, and it could not really reproduce it. 02:20.760 --> 02:25.200 And so, you know, time goes by, you can reproduce this bug. 02:25.200 --> 02:28.800 And, you know, other priorities come up. 02:28.800 --> 02:33.600 You can still not reproduce this, so eventually, you know, you just kind of kind of give up. 02:33.600 --> 02:36.760 And your organization says, nobody can reproduce this. 02:36.760 --> 02:40.480 We have other work here, so I'd rather have you work on this other thing. 02:40.480 --> 02:44.680 So, ultimately, you know, the way we can think of this is, there's this bug down here, 02:44.680 --> 02:49.880 the red dot, and there's this, you know, very high dimensional state space that can 02:49.880 --> 02:54.960 possibly reach this bug, or possibly not, right? 02:54.960 --> 03:03.240 And so, there are a gazillion combinations of things that may cost this bug, but ultimately 03:03.240 --> 03:10.120 what we want is one reproducible path that will always deterministically lead to this reproduction, 03:10.120 --> 03:11.120 right? 03:11.120 --> 03:15.480 That is the wish that we all have that we can actually reproduce these really difficult 03:15.480 --> 03:18.240 bugs reliably. 03:18.240 --> 03:23.200 And so, the true problem about all of this is randomness. 03:23.200 --> 03:24.600 The world is random, right? 03:24.600 --> 03:27.200 Like hardware fails randomly. 03:27.200 --> 03:30.440 Users behave in random ways. 03:30.440 --> 03:36.640 There's a bunch of things that is completely out of our control, but it is a reality that 03:36.640 --> 03:38.200 we have to work with. 03:38.200 --> 03:44.280 And so, this is the reason why these things are so freaking hard to reproduce, right? 03:44.280 --> 03:50.400 And then even if we do reproduce, sorry, if we do, if we do manage to find the fix for 03:50.400 --> 03:58.120 this bug, we still can definitively say that we have actually, in fact, this bug, right? 03:58.120 --> 04:03.040 If we can't actually reproduce it, how can we ever say for sure that we have, in fact, 04:03.040 --> 04:04.480 fixed it. 04:04.480 --> 04:07.320 And these bugs can be really expensive, right? 04:07.320 --> 04:12.680 We saw a funzer earlier, he spun up, you know, 16 of these really expensive machines, 04:12.680 --> 04:17.200 and the company happened to allow him to do this because it was a very high priority bug 04:17.200 --> 04:18.200 at the time. 04:18.200 --> 04:23.320 And so, companies and, you know, individuals end up spending a lot of money on bugs as 04:23.320 --> 04:26.480 well, potentially unnecessarily. 04:26.480 --> 04:31.800 And eventually, this can cause, you know, unhappy customers, unhappy users of your software, 04:31.800 --> 04:38.040 if you simply cannot, can never fix these bugs because you can reproduce them, and ultimately 04:38.040 --> 04:39.040 fix them. 04:39.600 --> 04:48.400 So, deterministic simulation testing is kind of the, one of the answers that we believe 04:48.400 --> 04:51.960 can help with these problems. 04:51.960 --> 04:57.280 And this was originally kind of popularized by FoundationDB, and the way that FoundationDB 04:57.280 --> 05:03.080 kind of went about this is they built this actor framework in C++, where they, and they built 05:03.160 --> 05:09.480 this framework specifically such that they built and accompanying simulator that they could 05:09.480 --> 05:14.760 simulate all of these failures and, and, and ran them way, it's, right? 05:14.760 --> 05:21.760 And so, that spark kind of, a number of companies to try and go up and do this themselves, 05:21.760 --> 05:27.000 Tiger Beatles, fairly popular for doing this, but there are a number of companies that maybe 05:27.000 --> 05:31.960 we haven't even heard of the, the storage team at Dropbox actually did this in the early 2010s 05:32.040 --> 05:33.360 already as well. 05:33.360 --> 05:39.760 A bunch of other companies have started started to do this, and a quick shout out to 05:39.760 --> 05:41.200 businesses as well. 05:41.200 --> 05:46.360 They basically, when one step further, they're the folks that originally did this with 05:46.360 --> 05:50.240 FoundationDB, and they've built a deterministic hypervisor so that you don't actually 05:50.240 --> 05:56.160 have to care about any of this in your application, and you can kind of try and explore 05:56.240 --> 06:01.800 this state space, the deterministically using their hypervisor and then also replay it 06:01.800 --> 06:02.800 deterministically. 06:02.800 --> 06:06.040 It's mind-blowing technology. 06:06.040 --> 06:13.400 Anyways, we're far too poor to pay for, for, for businesses, so we thought, you know, if 06:13.400 --> 06:20.840 we're already building something from scratch, we think we can build a simulator for ourselves. 06:20.840 --> 06:26.920 And so ultimately, there are four ingredients of randomness that we end up needing to control. 06:26.920 --> 06:32.800 The first one is event ordering, and ultimately what this means for us, and this is also 06:32.800 --> 06:38.960 one of the reasons why we ended up choosing Rust for our new database, is we need to be 06:38.960 --> 06:48.040 able to deterministically genuinely, deterministically execute an order of events in our system. 06:48.040 --> 06:54.840 And what that means is concretely on hardware is that for our simulator and our simulation 06:54.840 --> 06:59.320 test suite, they only to run on a single thread, because the moment that we go multi-threaded, 06:59.320 --> 07:03.120 we're already not deterministic anymore. 07:03.120 --> 07:07.240 And so, you know, in the very concretely what this means for our Rust application is that 07:07.240 --> 07:17.600 we use the like single-threaded Tokyo executor to make that happen. 07:17.600 --> 07:22.360 And then randomness, you know, very concretely, this can be random number generators, 07:22.360 --> 07:25.440 but also I'll go into this a little bit later. 07:25.440 --> 07:29.560 They're like implicit things that, you know, cause randomness as well. 07:29.560 --> 07:35.720 Time is a really difficult one, and this is one that we found was particularly difficult 07:35.720 --> 07:40.960 to control if it's not something that you think about from day one of your application, 07:40.960 --> 07:49.120 because, you know, if you end up calling, you know, time, time now somewhere in your program, 07:49.120 --> 07:50.560 you've kind of already lost, right? 07:50.560 --> 07:55.480 Like you need to be able to kind of see the time throughout the entire program. 07:55.480 --> 08:00.520 And then the very last thing, and this is ultimately how you end up really tickling out 08:00.520 --> 08:07.720 all the bugs of your code, you can randomly inject failures into in this simulator at any 08:07.720 --> 08:09.520 point in time. 08:09.520 --> 08:14.640 So once we can control all of these, we can actually build a test suite that we run 08:14.640 --> 08:19.520 against our system, we check some invariants about our system, and we can, you know, randomly 08:19.520 --> 08:21.960 control what happens. 08:21.960 --> 08:25.920 And so how do we structure our code to make that happen? 08:25.920 --> 08:31.320 So ultimately what we do is we have to relatively simple trade, the state machine, that 08:31.320 --> 08:36.080 implements the receive and the tick function. 08:36.080 --> 08:41.840 And ultimately, receive can be, you know, a response to any request that any other state 08:41.840 --> 08:49.800 machine may have produced, or a response to a request that a state machine has done. 08:49.800 --> 08:54.600 So, you know, let's say the state machine wants to do a request to some external system 08:54.600 --> 08:57.240 or something like that. 08:57.240 --> 09:05.480 So that way we can kind of control everything that happens external to the particular component. 09:05.480 --> 09:10.960 And tick, you know, either is there for periodic work generation. 09:10.960 --> 09:16.640 So, like, you know, some component does something on a schedule, but also, you can use it 09:16.640 --> 09:20.240 to kind of trigger timeouts and stuff like that. 09:20.240 --> 09:25.280 And the reason why we want this to be kind of separate is so that we can actually control 09:25.280 --> 09:27.200 this time component, right? 09:27.200 --> 09:29.680 And we need to always think about that these are state machines. 09:29.680 --> 09:33.680 So, like, every time that we call any of these functions, we mutate the state machine in 09:33.680 --> 09:38.760 some way, and output some messages to be consumed by others. 09:38.760 --> 09:47.040 And so that is actually something that Tyler Nealy of this led project, originally kind 09:47.040 --> 09:52.040 of came up with this trait for sled as well. 09:52.040 --> 09:57.280 And so we kind of just tried this out and happened to work quite well for us as well. 09:57.280 --> 10:00.720 So again, kind of taking a quick step back, why do we do this? 10:00.720 --> 10:07.920 So, we had two talks, earlier this day, today, in the rest room, we basically build a 10:07.920 --> 10:13.720 profiler and we built this database as a back end storage for this profiler. 10:13.720 --> 10:16.720 And then yesterday, we also had a talk in the database room, which happened to be in this 10:16.720 --> 10:21.960 room as well, that, you know, talks a little bit more about the architecture. 10:21.960 --> 10:25.120 But this talk is really just about how do we test this database, right? 10:25.120 --> 10:29.560 So, I just wanted to give you this in case you're interested in why do we do all of this. 10:29.560 --> 10:35.440 But on a very high level, the way that our system looks is that we have clients, clients 10:35.440 --> 10:40.280 push data, we have a component that we call the ingestor, that kind of buffers up data, 10:40.280 --> 10:43.280 and then ultimately flushes it to object storage. 10:43.280 --> 10:49.320 And then synchronously, there is a compaction component that regularly compacts data. 10:49.320 --> 10:53.760 Again, this isn't like the real, they're like a handful more components in the real system, 10:53.760 --> 10:58.480 but conceptually, this is how the system works, right? 10:58.480 --> 11:07.080 And so, the way that this end works in production and in the simulator is as follows. 11:07.080 --> 11:09.840 So, the simulator is essentially just a message bus. 11:09.840 --> 11:18.920 We saw that with our state machine interface, everything consumes and produces messages, right? 11:18.920 --> 11:26.080 And so, these are just messages that then go onto this, are cute into this message bus. 11:26.080 --> 11:30.000 And all of the components that we see here are just state machines. 11:30.000 --> 11:37.200 And in production, we have a separate driver for these state machines that actually take 11:37.200 --> 11:43.760 on a real interval, actually work with real networks, et cetera. 11:43.760 --> 11:50.680 And what's kind of nice about this is also that we can use a state machine to represent 11:50.680 --> 11:51.680 external components. 11:51.680 --> 11:57.760 So, obviously, we're not in writing the code for a GCS or S3 or something like that, 11:57.760 --> 12:05.040 but we can write a state machine that kind of acts as if we did, and we can control 12:05.040 --> 12:12.680 various failures and behaviors of those components, even if we don't fully control them. 12:12.680 --> 12:20.560 And so, like I said, work is produced on a tick and then transformed on receive. 12:20.560 --> 12:26.440 So, let's do this, I'll give you a real example of what this would look like in our testing 12:26.440 --> 12:27.440 harness. 12:27.440 --> 12:34.640 So, here we can see we've kind of initialized our queue with a bunch of ticks of our client. 12:34.640 --> 12:38.080 And again, our client is also just a state machine. 12:38.080 --> 12:45.880 And so, we tick and the client does its mutation of the state machine and ends up producing 12:45.880 --> 12:47.760 a new message onto the message queue. 12:47.760 --> 12:53.160 So, now the client says, I want the ingester to receive a write. 12:53.160 --> 12:59.920 And so, then the other components actually go about and receive this message, process it, 12:59.920 --> 13:00.920 and so on. 13:00.920 --> 13:10.080 And so, that's how, ultimately, we can produce a set of events in our system. 13:10.080 --> 13:14.680 And we can already see kind of, since we control this entire thing, I did this quite 13:14.720 --> 13:20.920 quickly, but like we can also, because we control all of this, we can reorder events randomly, 13:20.920 --> 13:22.440 we can drop events, right? 13:22.440 --> 13:28.040 Like we can do all sorts of interesting things that would be incredibly difficult to reproduce 13:28.040 --> 13:29.800 in a real environment, right? 13:29.800 --> 13:34.680 Potentially, potentially, we're testing even things that maybe are completely impossible, right? 13:34.680 --> 13:40.200 But maybe it's this one thing where something needs to be shut down for three years, and 13:40.200 --> 13:44.640 then come back online, and we can test all of those things. 13:45.400 --> 13:51.360 So, yeah, randomness, like I said, is something we need to control quite tightly, and something 13:51.360 --> 13:57.400 like this, and more generally, we need to write our entire system to be deterministic, right? 13:57.400 --> 14:03.520 So, this is something that's quite tricky to do, and so, obviously, there are kind 14:03.520 --> 14:09.960 of two kinds of randomness that would need to broadly care about quite deeply. 14:09.960 --> 14:14.120 One is, you know, I think the really obvious one, random number generators that are 14:14.120 --> 14:20.320 actually used by our software, that, you know, choose what next thing to execute or something 14:20.320 --> 14:21.320 like that. 14:21.320 --> 14:26.280 I'm obviously making something up, but, and the other thing is things that are implicit, 14:26.280 --> 14:31.720 so like hash map iterations, maybe different the next time that we run the same code, 14:31.720 --> 14:32.720 then it was the last time. 14:32.760 --> 14:41.160 So, we need to take a lot of care that, you know, every execution is actually deterministic, 14:41.160 --> 14:47.280 and every execution of the same input events will actually produce the same outputs. 14:47.280 --> 14:55.280 And so, one kind of neat trick that we found to ensure this is that we actually can test 14:55.280 --> 15:00.080 determinism, because what I've just said is, for the same inputs, we always must be getting 15:00.080 --> 15:06.960 the same outputs, and so, what we just do is we run the test harness with the same seed 15:06.960 --> 15:14.320 multiple times, and if that ends up generating a different set of events produced by those 15:14.320 --> 15:20.520 state machines again, then we've just proven that our system is in fact not deterministic, 15:20.520 --> 15:25.240 and then we can go and try to fix that determinism, right, so that we can make sure that 15:25.280 --> 15:30.720 when we do get a failure, it is actually deterministically reproducible. 15:30.720 --> 15:38.280 So, like I said, some of the, some things that we can do with deterministic simulation 15:38.280 --> 15:44.840 testing, because we control all of those four variables, including time, is we can say, 15:44.840 --> 15:50.280 you know, an ingester is supposed to tick, but it's actually only going to tick in six 15:50.280 --> 15:54.840 years, or I guess in five years, right, something that would be incredibly hard to do 15:54.840 --> 16:00.440 in real life, right, like how, who in their right mind would wait for actual five years of 16:00.440 --> 16:06.440 a production machine being online to test something, right, whereas for us it's just saying 16:06.440 --> 16:12.760 this tick has happened in five years in the future, and so this is really cool, because all 16:12.760 --> 16:22.120 the sudden we're no longer constrained by, like, real time, right, we're only constrained 16:22.120 --> 16:28.040 by how many CPU cycles can we actually run on a machine, right, and so what that means is we 16:28.040 --> 16:41.480 can simulate, you know, tens or hundreds of years of production sequences of events, every 16:41.480 --> 16:47.960 single day, right, and so this allows us to then find if there are any sequences that are, you 16:48.280 --> 16:57.160 potentially harmful for our system. So once again, let's look at an example of what could be 16:57.160 --> 17:01.720 interesting to simulate here, or what are the interesting things that are simulator might 17:01.720 --> 17:09.720 end up simulating, because, again, we originally randomly seed our simulator to try and explore 17:09.720 --> 17:16.360 random sequences of events, right, so one of the things that we could have is that, you know, 17:16.440 --> 17:24.680 we have on our message buzz, this message that our object storage is supposed to persist something, 17:24.680 --> 17:30.920 right, so one of the things that we can do is actually just say, this never happens, so ultimately 17:30.920 --> 17:36.440 what that means is the ingester should eventually retry, right, but another interesting that could 17:36.440 --> 17:45.880 happen is the right actually succeeds, but the acknowledgement is dropped, right, so all of these 17:46.200 --> 17:51.320 things, at any point in time, our simulator can decide, you know, now I'm not going to deliver 17:51.320 --> 17:55.880 this message, or now I'm going to restart this component, it's going to lose all of its state, 17:57.400 --> 18:02.440 and our system needs to be able to ultimately recover from this, and at all points in time, 18:02.440 --> 18:08.040 the invariance of our system might help must hold true, like snapshot isolation and what have 18:08.040 --> 18:18.200 you that we want to make sure are always true about our system, so I've talked a lot now about 18:18.200 --> 18:25.800 kind of the result or the ingredients that go into it, so what are some of the things that we found 18:25.800 --> 18:32.680 with this, so we've found some pretty devastating bugs using this methodology, so we found 18:33.000 --> 18:39.400 duplicate data being persisted in our system, we found cases of data loss, we found partial 18:39.400 --> 18:45.480 commits, we found random crashes of our system that shouldn't have happened, right, just about everything 18:45.480 --> 18:51.480 that you could possibly think of, and just so that you can actually visualize what this looks like 18:51.480 --> 18:58.200 in our CI, all that happens is that we generate a random seed at the start, we run our test 18:58.200 --> 19:05.560 suite with this random seed, it originally seeds our message queue with a number of actions, 19:05.560 --> 19:10.760 and then we just kind of let it go and do a thing, and like I said, we test a bunch of 19:12.920 --> 19:18.200 invariance about it the entire time, and if ever one of those variants don't hold true, 19:18.200 --> 19:24.920 we've found a bug, and one of the things that has been really interesting about this is that we 19:24.920 --> 19:31.000 actually have this true sequence of events, right, but it turns out this can be a lot of things, 19:31.000 --> 19:36.520 right, like we originally seeded with 64 events, but you know potentially hundreds of things can happen, 19:36.520 --> 19:43.240 and it can be quite difficult to understand what is the thing that has actually led to this bug, 19:43.240 --> 19:47.880 right, like yes we can now we're able to reproduce it, but still understanding what led to the 19:47.880 --> 19:53.640 bug is yet another thing, so one of the really cool things that we found to be really useful 19:54.200 --> 20:00.440 is that we have a sequence of events, right, so we can actually visualize this for you know 20:00.440 --> 20:08.360 our small tool humans to actually understand what has happened, so I'm just have a random example here, 20:08.360 --> 20:12.440 where we can see you know we have random sets of clients, random sets of ingestors, 20:13.160 --> 20:17.080 that all end up talking to each other, we have a random restart in the middle that happened 20:17.640 --> 20:24.200 and so yeah, this is just a random example of what you know a sequence like that may 20:24.200 --> 20:31.080 may look like, I'm going to have to go relatively quick, but like I said we've kind of tried this 20:31.080 --> 20:38.040 before and go, and with go what we had to do we kind of had to do you know unspeakable things 20:38.040 --> 20:45.960 to the runtime to make it deterministic, and that that did work, and it is kind of easier to do 20:46.040 --> 20:51.960 actually because once you make the runtime the deterministic you actually sort of get all the benefits 20:51.960 --> 21:00.360 out of that quite easily, however debugging with these state machines ends up being a little bit 21:00.360 --> 21:08.680 simpler, but one thing that definitely can be understated is changing your mental model about 21:08.760 --> 21:15.880 your system to always have to be within state machines is very different from the kind of you know 21:15.880 --> 21:23.080 suffering engineering that we've been exposed to before. I personally do think that this is a good 21:23.080 --> 21:28.120 thing because we always do think about the ultimate mutations in our system, but it can be quite 21:28.120 --> 21:34.680 difficult to kind of map what we want to do onto these state machines, and the simulator itself 21:34.680 --> 21:41.320 is actually really hard work, like it is software that needs as much care as any other test 21:41.320 --> 21:47.960 suite, any other software, so like if we start to neglect this it you know starts to start to show 21:47.960 --> 21:54.040 just like any other software. So yeah, the ST is not magic we need to put a lot of thought into it 21:54.840 --> 21:59.720 we thought about you know is there something better that we can do than just random seed that we start 22:00.200 --> 22:05.480 start with maybe we can start at some interesting state and then start to explore from there 22:05.480 --> 22:09.320 or we've previously determined this state is interesting and everything from there could be 22:09.320 --> 22:14.440 interesting to explore, right? There are a lot of things that can definitely be done. I said 22:14.440 --> 22:19.240 that I already said that we sometimes have to deal with a huge number of events and figure out 22:19.240 --> 22:26.280 what was the actual sequence, well we actually control the simulator, right? So we can try to shrink 22:27.000 --> 22:32.760 to the minimum amount of events that need to happen to actually reproduce the same 22:32.760 --> 22:39.800 same state, we call this the shrinking. So yeah with all of these ingredients and a lot of hard work 22:40.680 --> 22:48.520 you know we try to get rid of any unpredictable bugs and try to not have you know very critical 22:48.600 --> 22:52.920 bugs like this ever hit production through this testing. Thank you. 23:01.000 --> 23:03.720 Questions? We have a couple of questions. 23:03.720 --> 23:19.400 Thank you for the talk. I have a question when you have a specific seed. Can you speak up? I can't 23:19.400 --> 23:27.640 hear. If you have a seed that triggers a specific issue and you fix your state motions 23:27.720 --> 23:34.200 or can you ensure that the fix is correct because the seed will not match the new state 23:34.200 --> 23:42.200 machines to trigger the bug? Yeah great question. So basically you're saying how do we make sure 23:42.200 --> 23:49.800 that when we change the code that we're actually testing the same state space? So I mean we can 23:49.800 --> 23:54.520 test that by making sure that the log is the same, right? So if we actually do have the same 23:54.520 --> 24:03.080 the same sequence of events then we know that it is in fact the same thing. Obviously only 24:03.080 --> 24:11.000 leading up until the thing where we've proven that the invariant was violated, right? Does that make sense? 24:14.040 --> 24:21.800 I think you can have to talk outside because it's very interesting. Thank you very much. 24:24.520 --> 24:26.520 Thank you very much.