1 00:00:00,260 --> 00:00:04,605 >> Hi, everyone. My name is Tomek Masternak, 2 00:00:04,605 --> 00:00:07,890 I'm an engineer on the particular software, 3 00:00:07,890 --> 00:00:10,335 when I build distributed systems and 4 00:00:10,335 --> 00:00:13,755 platform before building distributed systems. 5 00:00:13,755 --> 00:00:16,770 In this presentation, I will be talking about, 6 00:00:16,770 --> 00:00:19,140 checking safety in Exactly-Once, 7 00:00:19,140 --> 00:00:21,270 which is another library that 8 00:00:21,270 --> 00:00:25,230 I built together with my friend Szymon Pobiega. 9 00:00:25,230 --> 00:00:29,010 Let me start with describing the problem and 10 00:00:29,010 --> 00:00:33,175 the challenge distributed system builders face, 11 00:00:33,175 --> 00:00:37,110 when they use messaging infrastructure. 12 00:00:38,810 --> 00:00:42,570 What is currently at the factor standard, 13 00:00:42,570 --> 00:00:45,230 both when it comes to the infrastructure which is 14 00:00:45,230 --> 00:00:49,070 available On-prem or in the Cloud, 15 00:00:49,070 --> 00:00:53,335 is At-Least-Once message delivery guarantee. 16 00:00:53,335 --> 00:00:58,100 That changed quite sometime in the past, 17 00:00:58,100 --> 00:01:00,320 before we used to have a two-phase commit 18 00:01:00,320 --> 00:01:03,485 available and distributed transactions. 19 00:01:03,485 --> 00:01:06,170 That's pretty much no longer the case 20 00:01:06,170 --> 00:01:08,840 and currently At-Least-Once message delivery 21 00:01:08,840 --> 00:01:13,250 is that reality that everyone has to work with. 22 00:01:13,250 --> 00:01:18,425 Now, At-Least-Once message delivery 23 00:01:18,425 --> 00:01:21,965 can be challenging from the system builder perspective 24 00:01:21,965 --> 00:01:25,910 especially when writing logic for handling messages that 25 00:01:25,910 --> 00:01:28,340 arrive or are being pushed from 26 00:01:28,340 --> 00:01:31,370 the messaging infrastructure to the receivers. 27 00:01:31,370 --> 00:01:34,730 Let us go through our hypothetical situation just to 28 00:01:34,730 --> 00:01:38,865 see what kind of situations I'm talking about. 29 00:01:38,865 --> 00:01:41,720 Let's assume that there are some messages 30 00:01:41,720 --> 00:01:46,190 in flight one, two and three and that is the order 31 00:01:46,190 --> 00:01:47,570 in which they are stored in the 32 00:01:47,570 --> 00:01:50,950 messaging infrastructure or in the queue. 33 00:01:50,950 --> 00:01:53,070 Because of the way, 34 00:01:53,070 --> 00:01:55,095 how the At-Least-Once message delivery 35 00:01:55,095 --> 00:01:58,145 works is perfectly possible for 36 00:01:58,145 --> 00:02:00,500 a message to be read, delivered, or 37 00:02:00,500 --> 00:02:04,380 delivered multiple times to the receiver. 38 00:02:06,170 --> 00:02:09,500 This is usually a simple consequence of 39 00:02:09,500 --> 00:02:12,710 the fact that messaging infrastructure will 40 00:02:12,710 --> 00:02:15,540 not consider a message to be delivered 41 00:02:15,540 --> 00:02:18,600 until it gets an acknowledgment from the receiver. 42 00:02:18,600 --> 00:02:21,410 Whenever the acknowledgment does not arrive 43 00:02:21,410 --> 00:02:24,215 or does not arrive in a given period of time, 44 00:02:24,215 --> 00:02:26,475 the message will reappear in the queue and 45 00:02:26,475 --> 00:02:29,365 will be processed or delivered once again. 46 00:02:29,365 --> 00:02:31,610 From the processing perspective, 47 00:02:31,610 --> 00:02:34,100 what is possible is that some messages 48 00:02:34,100 --> 00:02:37,570 get duplicated in our situation, it's message two. 49 00:02:37,570 --> 00:02:42,470 What's also possible, is reordering and this usually 50 00:02:42,470 --> 00:02:46,115 happens when a messaging infrastructure has 51 00:02:46,115 --> 00:02:50,925 some delivery or processing timeout. 52 00:02:50,925 --> 00:02:54,065 Whenever our messages beat out by the receiver, 53 00:02:54,065 --> 00:02:58,595 the timeout starts and if it reaches a threshold, 54 00:02:58,595 --> 00:03:01,340 that message will reappear in the queue 55 00:03:01,340 --> 00:03:04,625 one more time and potentially in the meantime, 56 00:03:04,625 --> 00:03:07,210 other messages might be successfully processed, 57 00:03:07,210 --> 00:03:10,610 so the effective ordering of processing that the receiver 58 00:03:10,610 --> 00:03:14,890 had will be different than the original ordering. 59 00:03:14,890 --> 00:03:18,440 Finally, both of those situations can happen 60 00:03:18,440 --> 00:03:21,590 independently at the same time, 61 00:03:21,590 --> 00:03:27,710 so via BN situation is perfectly fine for the messages 62 00:03:27,710 --> 00:03:31,250 to reorder and it's perfectly fine for messages to 63 00:03:31,250 --> 00:03:34,130 be duplicated and now that can be 64 00:03:34,130 --> 00:03:38,010 a challenge from the system builder perspective. 65 00:03:38,240 --> 00:03:41,825 A very common and standard solution to 66 00:03:41,825 --> 00:03:45,220 that situation is item potency, 67 00:03:45,220 --> 00:03:47,450 which basically means that whenever 68 00:03:47,450 --> 00:03:49,670 a duplicate of the message is received, 69 00:03:49,670 --> 00:03:52,790 the end result of processing that message should be the 70 00:03:52,790 --> 00:03:57,335 same as for the first processing of that message. 71 00:03:57,335 --> 00:04:01,565 Very often this is understood or this boils down 72 00:04:01,565 --> 00:04:05,905 to the message processing logic being deterministic. 73 00:04:05,905 --> 00:04:12,080 This is often the case in remote procedure 74 00:04:12,080 --> 00:04:14,690 call type of systems when there is 75 00:04:14,690 --> 00:04:17,690 only a single message in flight 76 00:04:17,690 --> 00:04:20,780 or a single call in flight and that call will be 77 00:04:20,780 --> 00:04:23,450 repeated over and over again until 78 00:04:23,450 --> 00:04:25,580 a successful acknowledgment comes in 79 00:04:25,580 --> 00:04:28,785 from the remoting location. 80 00:04:28,785 --> 00:04:32,120 Now, that might be not enough if there are 81 00:04:32,120 --> 00:04:35,420 more than one messaging flight and this is 82 00:04:35,420 --> 00:04:37,460 very often the case when we are building 83 00:04:37,460 --> 00:04:42,090 and modeling business processes. 84 00:04:42,090 --> 00:04:44,660 This is a very simple example which 85 00:04:44,660 --> 00:04:47,395 shows worldwide what might be the problem. 86 00:04:47,395 --> 00:04:50,630 What we are looking at here is a simple 87 00:04:50,630 --> 00:04:54,010 sequence diagram showing the behavior in 88 00:04:54,010 --> 00:04:57,220 a system which models 89 00:04:57,220 --> 00:05:01,025 or implements shooting range game. 90 00:05:01,025 --> 00:05:03,110 There is a handlers shooting range 91 00:05:03,110 --> 00:05:04,850 which is responsible for holding, 92 00:05:04,850 --> 00:05:05,990 a piece of spade which is 93 00:05:05,990 --> 00:05:12,600 the current target position of the target. 94 00:05:13,460 --> 00:05:18,450 The players can attempt to fire 95 00:05:18,450 --> 00:05:20,970 at the given position and the shooting range 96 00:05:20,970 --> 00:05:23,960 is responsible for sending back a response 97 00:05:23,960 --> 00:05:28,270 saying whether the attempt was successful or fail. 98 00:05:28,270 --> 00:05:31,550 So what we can see here is a situation 99 00:05:31,550 --> 00:05:33,260 in which we have a shooting range 100 00:05:33,260 --> 00:05:36,650 and the shooting range is set to target position 42 101 00:05:36,650 --> 00:05:41,500 and then we have a fire that fires at position 42. 102 00:05:41,500 --> 00:05:46,170 The response from that call is a hit. 103 00:05:46,170 --> 00:05:49,670 After that, the target is being moved to position one 104 00:05:49,670 --> 00:05:52,910 and finally, an interesting situation happens 105 00:05:52,910 --> 00:05:55,810 because we get a duplicate and duplicated fire 106 00:05:55,810 --> 00:05:58,525 at message which was already processed once, 107 00:05:58,525 --> 00:06:01,205 but then it's being reprocessed one more time. 108 00:06:01,205 --> 00:06:05,450 Logic is deterministic but the response that we've had is a miss. 109 00:06:05,450 --> 00:06:08,090 What happened in that scenario is that we 110 00:06:08,090 --> 00:06:13,790 have a single logical message being delivered twice 111 00:06:13,790 --> 00:06:16,370 and each of the processing of this 112 00:06:16,370 --> 00:06:22,220 message results in a contradictory result, 113 00:06:22,220 --> 00:06:25,970 contradictory side effect, which is usually 114 00:06:25,970 --> 00:06:27,770 far from what we are expecting 115 00:06:27,770 --> 00:06:30,900 from the systems that we're building. 116 00:06:31,780 --> 00:06:36,320 The obvious problem that we have here is that, 117 00:06:36,320 --> 00:06:38,840 even though the logical is deterministic, 118 00:06:38,840 --> 00:06:42,440 the state changed significantly in between 119 00:06:42,440 --> 00:06:44,300 the first processing and the duplicate 120 00:06:44,300 --> 00:06:46,945 which resulted in that behavior. 121 00:06:46,945 --> 00:06:51,590 What it shows is that the deterministic logic is not enough, 122 00:06:51,590 --> 00:06:55,230 we also need to think about the state. 123 00:06:55,230 --> 00:06:58,295 Either we need to capture the side effects 124 00:06:58,295 --> 00:07:00,980 and basically we'll do them whenever a duplicate 125 00:07:00,980 --> 00:07:03,680 comes in or we need to be 126 00:07:03,680 --> 00:07:06,920 able to get a handle at the historical state, 127 00:07:06,920 --> 00:07:10,890 as it was when the message first arrived. 128 00:07:12,450 --> 00:07:16,180 What we are assuming in the library is that 129 00:07:16,180 --> 00:07:18,370 the distributed system that we are talking 130 00:07:18,370 --> 00:07:20,980 about is field of bunch of handlers, 131 00:07:20,980 --> 00:07:25,645 each handler has a separate dedicated piece of state. 132 00:07:25,645 --> 00:07:29,080 That handler is sole holder and 133 00:07:29,080 --> 00:07:31,750 the only possibility to communicate 134 00:07:31,750 --> 00:07:35,140 between the handlers is via sending messages. 135 00:07:35,140 --> 00:07:37,705 What each handler is doing is that it 136 00:07:37,705 --> 00:07:40,285 picks up the message from the input queue, 137 00:07:40,285 --> 00:07:42,040 executes the business logic, 138 00:07:42,040 --> 00:07:43,930 which results in state update 139 00:07:43,930 --> 00:07:47,090 and puts a message in the output queue. 140 00:07:47,400 --> 00:07:56,305 The main idea or goal of the library was, as stated here, 141 00:07:56,305 --> 00:07:59,245 to make sure that the observable side effects 142 00:07:59,245 --> 00:08:01,705 correspond to some serial execution of 143 00:08:01,705 --> 00:08:06,130 input messages with atomic commit guarantee 144 00:08:06,130 --> 00:08:08,545 between the business storage and the output queue. 145 00:08:08,545 --> 00:08:10,810 Basically what we wanted to make sure is 146 00:08:10,810 --> 00:08:13,930 that whenever a duplicate comes in, 147 00:08:13,930 --> 00:08:17,170 the side effects that we'll produce on the state 148 00:08:17,170 --> 00:08:20,875 and in terms of sending of the output messages, 149 00:08:20,875 --> 00:08:29,380 is the same as if there was a single processing and 150 00:08:29,380 --> 00:08:34,100 the ordering is corresponding to some possible serial execution. 151 00:08:34,170 --> 00:08:40,730 The implementation of the library is based on the Azure Stack. 152 00:08:41,280 --> 00:08:44,935 If they run in Azure Function host, 153 00:08:44,935 --> 00:08:48,310 which basically means now the whole will be messaging 154 00:08:48,310 --> 00:08:50,845 infrastructure available for Azure functions 155 00:08:50,845 --> 00:08:52,885 is available there as well. 156 00:08:52,885 --> 00:08:56,635 The state is stored in Azure Cosmos DB. 157 00:08:56,635 --> 00:09:01,150 Specifically, what do we rely on in terms of 158 00:09:01,150 --> 00:09:03,520 Azure Cosmos DB is that it 159 00:09:03,520 --> 00:09:06,130 provides optimistic concurrency control, 160 00:09:06,130 --> 00:09:08,650 which is pretty important or the key part 161 00:09:08,650 --> 00:09:11,710 of the approach that we are taking. 162 00:09:11,710 --> 00:09:18,250 We store our infrastructural library-driven data 163 00:09:18,250 --> 00:09:20,860 in a separate logical partition, 164 00:09:20,860 --> 00:09:23,470 which also comes with a consequence which is that 165 00:09:23,470 --> 00:09:25,810 there is no atomic transactions between 166 00:09:25,810 --> 00:09:28,735 the business state and the infrastructural state. 167 00:09:28,735 --> 00:09:31,930 We assume that the users are using 168 00:09:31,930 --> 00:09:35,830 Azure Cosmos DB with session consistency model. 169 00:09:35,830 --> 00:09:39,730 Finally, high performance scenarios are out of scope. 170 00:09:39,730 --> 00:09:41,845 It's not something that we focused on. 171 00:09:41,845 --> 00:09:46,225 Good enough performance, whatever that means, was our goal. 172 00:09:46,225 --> 00:09:50,590 Now, the general idea of the algorithm is as follows. 173 00:09:50,590 --> 00:09:52,240 Whenever a message comes in, 174 00:09:52,240 --> 00:09:54,385 we check whether we already processed it. 175 00:09:54,385 --> 00:09:56,305 If not, we load the state. 176 00:09:56,305 --> 00:09:58,270 We run the business logic 177 00:09:58,270 --> 00:10:01,675 and we do not send out the output messages, 178 00:10:01,675 --> 00:10:04,735 but we capture them and store in this form. 179 00:10:04,735 --> 00:10:08,515 Now, we attach a single correlation ID 180 00:10:08,515 --> 00:10:10,945 with the business piece of data, 181 00:10:10,945 --> 00:10:17,050 and then we commit it to the business logic or partition. 182 00:10:17,050 --> 00:10:20,950 This is where we do the optimistic concurrency trade. 183 00:10:20,950 --> 00:10:24,970 We do that only if that entity did not 184 00:10:24,970 --> 00:10:29,500 change since we loaded it from the store. 185 00:10:29,500 --> 00:10:31,270 If that is successful, 186 00:10:31,270 --> 00:10:36,970 this is basically the point of no return. 187 00:10:36,970 --> 00:10:41,065 The atomic commit protocol, once that is done, 188 00:10:41,065 --> 00:10:45,010 the both parts, which is the business state which is 189 00:10:45,010 --> 00:10:47,515 already committed and the output messages 190 00:10:47,515 --> 00:10:49,510 will be committed eventual. 191 00:10:49,510 --> 00:10:52,840 Now, an important thing here to 192 00:10:52,840 --> 00:10:56,200 notice is that during some failure situations, 193 00:10:56,200 --> 00:10:58,810 what can happen is that the output messages 194 00:10:58,810 --> 00:11:01,375 might have been sent out multiple times. 195 00:11:01,375 --> 00:11:04,390 However, because the duplicates can 196 00:11:04,390 --> 00:11:09,145 happen with At-Least-Once message delivery already, 197 00:11:09,145 --> 00:11:12,955 we can say that we are hiding behind that failure mode. 198 00:11:12,955 --> 00:11:16,165 Basically, because the duplicates can be already there, 199 00:11:16,165 --> 00:11:17,650 we can produce duplicates because 200 00:11:17,650 --> 00:11:20,380 whoever is processing them should be 201 00:11:20,380 --> 00:11:24,350 and is expected to cope with that situation. 202 00:11:25,230 --> 00:11:29,335 That was the general idea 203 00:11:29,335 --> 00:11:32,140 about the problem and the solution 204 00:11:32,140 --> 00:11:35,005 that we seek and wanted to implement. 205 00:11:35,005 --> 00:11:42,100 Now I wanted to talk about the TLA+ Model 206 00:11:42,100 --> 00:11:47,540 and how we use that model to validate the safety. 207 00:11:48,210 --> 00:11:52,000 The goals of the model was to make sure that 208 00:11:52,000 --> 00:11:56,050 our algorithm works not only in 209 00:11:56,050 --> 00:11:59,990 the happy path but also on in the things to implement. 210 00:12:00,360 --> 00:12:04,195 As we all know, it's very easy to 211 00:12:04,195 --> 00:12:07,735 show that the algorithm sometimes works, 212 00:12:07,735 --> 00:12:10,390 but it's actually very tricky to figure 213 00:12:10,390 --> 00:12:13,705 out all the further case scenario. 214 00:12:13,705 --> 00:12:19,810 We wanted to make sure that we have some validation, 215 00:12:19,810 --> 00:12:24,415 some model checking down to make sure that 216 00:12:24,415 --> 00:12:28,180 we actually are convinced ourselves that 217 00:12:28,180 --> 00:12:31,480 the algorithm does what we wanted it to do. 218 00:12:31,480 --> 00:12:35,885 Specifically what we wanted to check was the safety properties. 219 00:12:35,885 --> 00:12:41,880 Those safety properties were described as atomic commit. 220 00:12:41,880 --> 00:12:43,710 Basically we wanted to make sure 221 00:12:43,710 --> 00:12:55,450 that when we commit to the business logic store we 222 00:12:55,450 --> 00:12:58,840 eventually push out the changes to the output queue, 223 00:12:58,840 --> 00:13:04,090 and also made changes to be the input queue, 224 00:13:04,090 --> 00:13:06,610 and also we wanted to make sure that 225 00:13:06,610 --> 00:13:13,045 the side effects are consistent in a way that, 226 00:13:13,045 --> 00:13:18,820 even for duplicates, there is only one change 227 00:13:18,820 --> 00:13:23,740 in the business state and the output messages 228 00:13:23,740 --> 00:13:27,280 and the business state change were made over 229 00:13:27,280 --> 00:13:30,880 the same version of the business state, which I think 230 00:13:30,880 --> 00:13:35,210 will be a bit clearer when we get to the formulas. 231 00:13:35,250 --> 00:13:39,130 Finally, we wanted to make sure that everything 232 00:13:39,130 --> 00:13:43,420 works as expected even without atomic transaction 233 00:13:43,420 --> 00:13:45,490 between different logical partitions. 234 00:13:45,490 --> 00:13:46,780 Because as I stated, 235 00:13:46,780 --> 00:13:49,480 we start the business logic and 236 00:13:49,480 --> 00:13:55,130 the infrastructural data in a separate partitions. 237 00:13:56,130 --> 00:14:00,760 Obviously, there were most of the things that were 238 00:14:00,760 --> 00:14:03,415 in the implementation did not make it to the model, 239 00:14:03,415 --> 00:14:05,200 and specifically we did not model 240 00:14:05,200 --> 00:14:07,419 the cleanup logic which is responsible 241 00:14:07,419 --> 00:14:09,970 for clearing the infrastructural data 242 00:14:09,970 --> 00:14:11,920 that is stored by the library. 243 00:14:11,920 --> 00:14:14,875 We did not model the model resistance that are used 244 00:14:14,875 --> 00:14:18,249 to lower the chance of concurrency and collisions, 245 00:14:18,249 --> 00:14:21,550 optimistic concurrency check failures basically. 246 00:14:21,550 --> 00:14:24,320 We also did not model the exponential backlogs 247 00:14:24,320 --> 00:14:26,360 and processing retries which are there 248 00:14:26,360 --> 00:14:28,685 to make the library a bit 249 00:14:28,685 --> 00:14:32,985 more performing in the failure scenarios. 250 00:14:32,985 --> 00:14:36,650 Let's look at the specification. 251 00:14:36,650 --> 00:14:39,440 This is the main part of 252 00:14:39,440 --> 00:14:43,440 the process that models the single handler. 253 00:14:43,440 --> 00:14:47,900 Most of the labels are folded 254 00:14:47,900 --> 00:14:52,250 and that shows the main structural specification. 255 00:14:52,250 --> 00:14:55,130 What we have here is the infinite while loop 256 00:14:55,130 --> 00:14:58,060 and in that loop the process is trying to 257 00:14:58,060 --> 00:15:00,565 input message forms the input queue. 258 00:15:00,565 --> 00:15:06,890 Then it checks by looking at the transaction field on the state, 259 00:15:06,890 --> 00:15:12,890 whether it needs to redo some of the transactional steps to 260 00:15:12,890 --> 00:15:16,220 basically push out the changes that 261 00:15:16,220 --> 00:15:22,345 are still there and not pushed to the output queue. 262 00:15:22,345 --> 00:15:24,470 After doing that, we check 263 00:15:24,470 --> 00:15:26,530 whether we already processed the message. 264 00:15:26,530 --> 00:15:28,685 If we did not process the message, 265 00:15:28,685 --> 00:15:31,250 we model the business logic execution 266 00:15:31,250 --> 00:15:34,025 capturing the side effects storing them, 267 00:15:34,025 --> 00:15:36,305 the optimistic concurrency commits 268 00:15:36,305 --> 00:15:38,270 to the business store and then pushing 269 00:15:38,270 --> 00:15:44,940 out the transaction until it's finished. 270 00:15:46,020 --> 00:15:49,900 The main reason why we use plus calc was that 271 00:15:49,900 --> 00:15:55,220 this syntax was more familiar for us and secondly, 272 00:15:55,220 --> 00:15:57,860 that it actually corresponds pretty well 273 00:15:57,860 --> 00:16:02,100 when it comes to the implementation that was done in C#. 274 00:16:02,100 --> 00:16:05,600 We were able to look at the code side-by-side 275 00:16:05,600 --> 00:16:08,590 and the specification corresponded 276 00:16:08,590 --> 00:16:10,940 pretty well with what we could see on 277 00:16:10,940 --> 00:16:15,815 the screen looking at the real implementation. 278 00:16:15,815 --> 00:16:19,200 When it comes to state modeling, 279 00:16:19,200 --> 00:16:22,205 there wasn't that much happening. 280 00:16:22,205 --> 00:16:24,590 Two important notices is 281 00:16:24,590 --> 00:16:27,080 that the input queue was modeled as a set, 282 00:16:27,080 --> 00:16:34,030 but that set had a records with two fields, 283 00:16:34,030 --> 00:16:36,280 One which was the logical message ID 284 00:16:36,280 --> 00:16:37,960 and duplicate message ID. 285 00:16:37,960 --> 00:16:41,780 The way how we model the duplicates was 286 00:16:41,780 --> 00:16:46,705 basically by starting off with duplicates already put queue. 287 00:16:46,705 --> 00:16:49,665 When it comes to the business stack modeling, 288 00:16:49,665 --> 00:16:53,960 we were a starting a sequence which was a history of 289 00:16:53,960 --> 00:16:55,940 snapshots of the states as they 290 00:16:55,940 --> 00:16:58,980 were throughout the execution of the algorithm, 291 00:16:58,980 --> 00:17:03,270 and that was pretty useful when writing the invariants. 292 00:17:03,600 --> 00:17:07,320 We also modeled the diversion 293 00:17:07,320 --> 00:17:09,600 for the optimistic concurrency check. 294 00:17:09,600 --> 00:17:12,150 All the other bits were used to model 295 00:17:12,150 --> 00:17:14,699 the infrastructural storage 296 00:17:14,699 --> 00:17:19,330 and the output to which was also set. 297 00:17:19,350 --> 00:17:22,360 Another interesting thing is that we did not use is 298 00:17:22,360 --> 00:17:26,185 a default termination formula. 299 00:17:26,185 --> 00:17:28,435 What we said was that 300 00:17:28,435 --> 00:17:32,650 the processes were running in the infinite loops. 301 00:17:32,650 --> 00:17:34,960 And the power termination of 302 00:17:34,960 --> 00:17:37,750 the algorithm was a situation in which 303 00:17:37,750 --> 00:17:40,930 all the processes were in the lock-in 304 00:17:40,930 --> 00:17:44,230 message and the input queue was empty. 305 00:17:44,230 --> 00:17:46,660 Basically, the termination was modeling 306 00:17:46,660 --> 00:17:49,600 the situation in which the input cue is trained 307 00:17:49,600 --> 00:17:54,715 and none of the handler is doing anything in specific. 308 00:17:54,715 --> 00:17:59,000 But waiting for anything to appear in that input. 309 00:17:59,400 --> 00:18:03,100 Finally, the safety invariants. 310 00:18:03,100 --> 00:18:06,025 This is what we ended up when it comes to safety. 311 00:18:06,025 --> 00:18:13,150 That basically models or expresses the safety property, 312 00:18:13,150 --> 00:18:16,765 which makes sure that for any logical message 313 00:18:16,765 --> 00:18:20,520 there is AtMostOneStateChange happening to 314 00:18:20,520 --> 00:18:23,910 the business store and most once message 315 00:18:23,910 --> 00:18:28,635 being produced and sent to the output cue. 316 00:18:28,635 --> 00:18:31,780 Those are the first two parts, 317 00:18:32,040 --> 00:18:35,875 one on the top and one in the middle. 318 00:18:35,875 --> 00:18:39,670 Finally, the one which is called consistent state 319 00:18:39,670 --> 00:18:44,335 and output says that if the message is fully processed, 320 00:18:44,335 --> 00:18:51,670 the version on which the diversion for the business state 321 00:18:51,670 --> 00:18:54,550 and the output message should be the same. 322 00:18:54,550 --> 00:19:00,459 Basically we make sure that those two pieces of side-effects 323 00:19:00,459 --> 00:19:04,435 are consistent in a way that they operated and were 324 00:19:04,435 --> 00:19:06,910 used by a business logic operating on 325 00:19:06,910 --> 00:19:09,830 the same version of the business set. 326 00:19:11,310 --> 00:19:16,225 Another interesting bit is how we modeled failures. 327 00:19:16,225 --> 00:19:19,225 What we ended up was a combat role, 328 00:19:19,225 --> 00:19:21,790 but basically was a single instruction 329 00:19:21,790 --> 00:19:24,160 which was filled two main loop, 330 00:19:24,160 --> 00:19:27,760 which basically meant that model 331 00:19:27,760 --> 00:19:29,980 the situation in which either the process 332 00:19:29,980 --> 00:19:32,500 restarts or some exception is being 333 00:19:32,500 --> 00:19:36,310 thrown and we basically should have the top of 334 00:19:36,310 --> 00:19:41,605 the stack and go back looking in that infinite fail loop. 335 00:19:41,605 --> 00:19:44,755 The way that we used it was that whenever 336 00:19:44,755 --> 00:19:48,235 a failure was possible or we wanted to model at failure, 337 00:19:48,235 --> 00:19:51,820 we would do either or expression. 338 00:19:51,820 --> 00:19:55,750 For instance, in the commit state macro we can see that if 339 00:19:55,750 --> 00:20:01,795 the concurrent optimistic concurrency check results in intro, 340 00:20:01,795 --> 00:20:05,875 we either commit right to the store or we fail. 341 00:20:05,875 --> 00:20:07,900 That was a general approach 342 00:20:07,900 --> 00:20:11,665 to monitoring those kinds of failures. 343 00:20:11,665 --> 00:20:15,790 Okay, so now a few words about the results. 344 00:20:15,790 --> 00:20:18,895 There were expected and unexpected results. 345 00:20:18,895 --> 00:20:20,905 When it comes to the expected results, 346 00:20:20,905 --> 00:20:23,020 those were obviously bugs to be 347 00:20:23,020 --> 00:20:25,690 found in various parts of the algorithm. 348 00:20:25,690 --> 00:20:29,380 Specifically in the post-failure commits and also 349 00:20:29,380 --> 00:20:33,700 in the logic that is signs, the transaction id. 350 00:20:33,700 --> 00:20:36,985 As I said, that was something that we 351 00:20:36,985 --> 00:20:40,135 wanted to get from the model check-ins. 352 00:20:40,135 --> 00:20:42,790 So that's why I called it expected. 353 00:20:42,790 --> 00:20:46,270 I don't want to deprecate the value of that 354 00:20:46,270 --> 00:20:49,510 because that was exactly what we want. 355 00:20:49,510 --> 00:20:52,810 However, there were also unexpected bits. 356 00:20:52,810 --> 00:20:56,110 I would say that the main unexpected bit was 357 00:20:56,110 --> 00:21:00,295 the fact that when we started off writing the model, 358 00:21:00,295 --> 00:21:02,530 we realized that we don't really 359 00:21:02,530 --> 00:21:06,190 understand what is it that we are actually doing. 360 00:21:06,190 --> 00:21:11,020 So it actually forced us to distill the algorithm to 361 00:21:11,020 --> 00:21:13,930 its essentials and to make sure 362 00:21:13,930 --> 00:21:15,850 whether the assumptions that 363 00:21:15,850 --> 00:21:18,340 we are making which are pretty important and 364 00:21:18,340 --> 00:21:24,745 also what are the most important steps in that algorithm. 365 00:21:24,745 --> 00:21:28,690 As a follow-up of doubt, we realized that there are 366 00:21:28,690 --> 00:21:31,960 actually some extensions that we could also use. 367 00:21:31,960 --> 00:21:34,660 For instance, I already said that one of 368 00:21:34,660 --> 00:21:37,860 the main assumptions that we are making is 369 00:21:37,860 --> 00:21:40,140 that we don't need any kind of 370 00:21:40,140 --> 00:21:43,845 a preparer phase for the output queue, 371 00:21:43,845 --> 00:21:46,565 because we can always commit to that output queue. 372 00:21:46,565 --> 00:21:49,285 We can always send a message there. 373 00:21:49,285 --> 00:21:52,060 The duplicates can happen there because 374 00:21:52,060 --> 00:21:54,100 there's already that failure mode, 375 00:21:54,100 --> 00:21:56,230 which we need to cope with. 376 00:21:56,230 --> 00:21:58,840 But what we realized is that there might be 377 00:21:58,840 --> 00:22:02,485 some other resources that are always able to commit. 378 00:22:02,485 --> 00:22:04,540 So for instance, if we have 379 00:22:04,540 --> 00:22:09,730 some write once store, so for instance, 380 00:22:09,730 --> 00:22:15,625 a blobs store with a [inaudible] blobs that can be 381 00:22:15,625 --> 00:22:17,530 written only once that could be 382 00:22:17,530 --> 00:22:19,645 an expression to that algorithm as well. 383 00:22:19,645 --> 00:22:23,210 That could be a supported side effect 384 00:22:24,080 --> 00:22:28,240 that we could apply the same way 385 00:22:28,240 --> 00:22:30,310 as we are sending out the messages. 386 00:22:30,310 --> 00:22:35,440 Finally, what we realized is that we did not know 387 00:22:35,440 --> 00:22:37,750 the cost of [inaudible] well enough and all 388 00:22:37,750 --> 00:22:40,600 the memory model implications. 389 00:22:40,600 --> 00:22:42,880 So we had to go back to the drawing board, 390 00:22:42,880 --> 00:22:45,400 and understand what doesn't mean that we are 391 00:22:45,400 --> 00:22:52,120 running in a session memory model. 392 00:22:52,120 --> 00:22:55,960 Also that there are no atomic transactions 393 00:22:55,960 --> 00:22:58,045 between the logical preparations. 394 00:22:58,045 --> 00:23:02,890 Finally, another unexpected thing 395 00:23:02,890 --> 00:23:07,240 was that in order to create a meaningful model, 396 00:23:07,240 --> 00:23:08,980 we had to understand what are 397 00:23:08,980 --> 00:23:12,070 the concurrency characteristics of the algorithm. 398 00:23:12,070 --> 00:23:13,630 What are the parts that can happen 399 00:23:13,630 --> 00:23:16,165 independently and concurrently? 400 00:23:16,165 --> 00:23:18,820 But even more importantly, we had to learn 401 00:23:18,820 --> 00:23:21,910 what are the failure modes or what are 402 00:23:21,910 --> 00:23:24,070 the failure situations that may happen in 403 00:23:24,070 --> 00:23:26,500 different parts of the system as well. 404 00:23:26,500 --> 00:23:29,980 So it actually forced us to understand the technology 405 00:23:29,980 --> 00:23:35,840 that we're using are better than we knew it beforehand. 406 00:23:36,870 --> 00:23:42,070 So we did the validation, we did the model checking. 407 00:23:42,070 --> 00:23:48,700 An interesting note here is that whenever there was a bug 408 00:23:48,700 --> 00:23:51,895 in a specific location or in the algorithm, 409 00:23:51,895 --> 00:23:56,425 we were able to find it on very small models. 410 00:23:56,425 --> 00:23:57,880 And by very small models, 411 00:23:57,880 --> 00:24:01,270 I mean a single message with a single duplicate 412 00:24:01,270 --> 00:24:06,970 and a single process. Almost that. 413 00:24:06,970 --> 00:24:12,370 In that sense, the running time of model checking 414 00:24:12,370 --> 00:24:14,770 was never a problem for us because whenever 415 00:24:14,770 --> 00:24:19,450 there were bugs, we would get a feedback immediately. 416 00:24:19,450 --> 00:24:22,270 Now that being said, after we settled on 417 00:24:22,270 --> 00:24:27,070 an algorithm and we checked it with a smaller model. 418 00:24:27,070 --> 00:24:30,235 We also went with a bigger one. 419 00:24:30,235 --> 00:24:32,665 The bigger that we did was 420 00:24:32,665 --> 00:24:37,120 two processes with six messages in the input cue, 421 00:24:37,120 --> 00:24:39,430 those are numbers in terms of 422 00:24:39,430 --> 00:24:42,160 number of space that we ended up with 423 00:24:42,160 --> 00:24:46,360 and that was running roughly two-and-a-half hours 424 00:24:46,360 --> 00:24:50,200 on a laptop, my personal laptop, that I would consider 425 00:24:50,200 --> 00:24:55,490 a high-end laptop based on current standards. 426 00:24:56,220 --> 00:25:01,210 That's it that I prepared for you in this presentation. 427 00:25:01,210 --> 00:25:04,330 If anyone wants to know more about 428 00:25:04,330 --> 00:25:07,855 the work that we did around TLA class and the library. 429 00:25:07,855 --> 00:25:11,620 We are posting about that at exactly-once.github.io. 430 00:25:11,620 --> 00:25:15,250 I'm also available on Twitter at Masternak. 431 00:25:15,250 --> 00:25:18,625 If anyone wants to chat, please reach out to me. 432 00:25:18,625 --> 00:25:23,120 I'd really like to share ideas. Thank you very much.