1
00:00:00,260 --> 00:00:04,605
>> Hi, everyone. My name
is Tomek Masternak,

2
00:00:04,605 --> 00:00:07,890
I'm an engineer on the
particular software,

3
00:00:07,890 --> 00:00:10,335
when I build distributed systems and

4
00:00:10,335 --> 00:00:13,755
platform before building
distributed systems.

5
00:00:13,755 --> 00:00:16,770
In this presentation, I
will be talking about,

6
00:00:16,770 --> 00:00:19,140
checking safety in Exactly-Once,

7
00:00:19,140 --> 00:00:21,270
which is another library that

8
00:00:21,270 --> 00:00:25,230
I built together with my
friend Szymon Pobiega.

9
00:00:25,230 --> 00:00:29,010
Let me start with
describing the problem and

10
00:00:29,010 --> 00:00:33,175
the challenge distributed
system builders face,

11
00:00:33,175 --> 00:00:37,110
when they use messaging
infrastructure.

12
00:00:38,810 --> 00:00:42,570
What is currently at
the factor standard,

13
00:00:42,570 --> 00:00:45,230
both when it comes to the
infrastructure which is

14
00:00:45,230 --> 00:00:49,070
available On-prem or in the Cloud,

15
00:00:49,070 --> 00:00:53,335
is At-Least-Once message
delivery guarantee.

16
00:00:53,335 --> 00:00:58,100
That changed quite
sometime in the past,

17
00:00:58,100 --> 00:01:00,320
before we used to have
a two-phase commit

18
00:01:00,320 --> 00:01:03,485
available and distributed
transactions.

19
00:01:03,485 --> 00:01:06,170
That's pretty much no longer the case

20
00:01:06,170 --> 00:01:08,840
and currently At-Least-Once
message delivery

21
00:01:08,840 --> 00:01:13,250
is that reality that
everyone has to work with.

22
00:01:13,250 --> 00:01:18,425
Now, At-Least-Once message delivery

23
00:01:18,425 --> 00:01:21,965
can be challenging from the
system builder perspective

24
00:01:21,965 --> 00:01:25,910
especially when writing logic
for handling messages that

25
00:01:25,910 --> 00:01:28,340
arrive or are being pushed from

26
00:01:28,340 --> 00:01:31,370
the messaging infrastructure
to the receivers.

27
00:01:31,370 --> 00:01:34,730
Let us go through our
hypothetical situation just to

28
00:01:34,730 --> 00:01:38,865
see what kind of situations
I'm talking about.

29
00:01:38,865 --> 00:01:41,720
Let's assume that there
are some messages

30
00:01:41,720 --> 00:01:46,190
in flight one, two and
three and that is the order

31
00:01:46,190 --> 00:01:47,570
in which they are stored in the

32
00:01:47,570 --> 00:01:50,950
messaging infrastructure
or in the queue.

33
00:01:50,950 --> 00:01:53,070
Because of the way,

34
00:01:53,070 --> 00:01:55,095
how the At-Least-Once
message delivery

35
00:01:55,095 --> 00:01:58,145
works is perfectly possible for

36
00:01:58,145 --> 00:02:00,500
a message to be read, delivered, or

37
00:02:00,500 --> 00:02:04,380
delivered multiple
times to the receiver.

38
00:02:06,170 --> 00:02:09,500
This is usually a
simple consequence of

39
00:02:09,500 --> 00:02:12,710
the fact that messaging
infrastructure will

40
00:02:12,710 --> 00:02:15,540
not consider a message
to be delivered

41
00:02:15,540 --> 00:02:18,600
until it gets an acknowledgment
from the receiver.

42
00:02:18,600 --> 00:02:21,410
Whenever the acknowledgment
does not arrive

43
00:02:21,410 --> 00:02:24,215
or does not arrive in a
given period of time,

44
00:02:24,215 --> 00:02:26,475
the message will reappear
in the queue and

45
00:02:26,475 --> 00:02:29,365
will be processed or
delivered once again.

46
00:02:29,365 --> 00:02:31,610
From the processing perspective,

47
00:02:31,610 --> 00:02:34,100
what is possible is
that some messages

48
00:02:34,100 --> 00:02:37,570
get duplicated in our
situation, it's message two.

49
00:02:37,570 --> 00:02:42,470
What's also possible, is
reordering and this usually

50
00:02:42,470 --> 00:02:46,115
happens when a messaging
infrastructure has

51
00:02:46,115 --> 00:02:50,925
some delivery or processing timeout.

52
00:02:50,925 --> 00:02:54,065
Whenever our messages
beat out by the receiver,

53
00:02:54,065 --> 00:02:58,595
the timeout starts and if
it reaches a threshold,

54
00:02:58,595 --> 00:03:01,340
that message will
reappear in the queue

55
00:03:01,340 --> 00:03:04,625
one more time and
potentially in the meantime,

56
00:03:04,625 --> 00:03:07,210
other messages might be
successfully processed,

57
00:03:07,210 --> 00:03:10,610
so the effective ordering of
processing that the receiver

58
00:03:10,610 --> 00:03:14,890
had will be different than
the original ordering.

59
00:03:14,890 --> 00:03:18,440
Finally, both of those
situations can happen

60
00:03:18,440 --> 00:03:21,590
independently at the same time,

61
00:03:21,590 --> 00:03:27,710
so via BN situation is
perfectly fine for the messages

62
00:03:27,710 --> 00:03:31,250
to reorder and it's perfectly
fine for messages to

63
00:03:31,250 --> 00:03:34,130
be duplicated and now that can be

64
00:03:34,130 --> 00:03:38,010
a challenge from the system
builder perspective.

65
00:03:38,240 --> 00:03:41,825
A very common and
standard solution to

66
00:03:41,825 --> 00:03:45,220
that situation is item potency,

67
00:03:45,220 --> 00:03:47,450
which basically means that whenever

68
00:03:47,450 --> 00:03:49,670
a duplicate of the
message is received,

69
00:03:49,670 --> 00:03:52,790
the end result of processing
that message should be the

70
00:03:52,790 --> 00:03:57,335
same as for the first
processing of that message.

71
00:03:57,335 --> 00:04:01,565
Very often this is understood
or this boils down

72
00:04:01,565 --> 00:04:05,905
to the message processing
logic being deterministic.

73
00:04:05,905 --> 00:04:12,080
This is often the case
in remote procedure

74
00:04:12,080 --> 00:04:14,690
call type of systems when there is

75
00:04:14,690 --> 00:04:17,690
only a single message in flight

76
00:04:17,690 --> 00:04:20,780
or a single call in flight
and that call will be

77
00:04:20,780 --> 00:04:23,450
repeated over and over again until

78
00:04:23,450 --> 00:04:25,580
a successful acknowledgment comes in

79
00:04:25,580 --> 00:04:28,785
from the remoting location.

80
00:04:28,785 --> 00:04:32,120
Now, that might be not
enough if there are

81
00:04:32,120 --> 00:04:35,420
more than one messaging
flight and this is

82
00:04:35,420 --> 00:04:37,460
very often the case
when we are building

83
00:04:37,460 --> 00:04:42,090
and modeling business processes.

84
00:04:42,090 --> 00:04:44,660
This is a very simple example which

85
00:04:44,660 --> 00:04:47,395
shows worldwide what
might be the problem.

86
00:04:47,395 --> 00:04:50,630
What we are looking
at here is a simple

87
00:04:50,630 --> 00:04:54,010
sequence diagram
showing the behavior in

88
00:04:54,010 --> 00:04:57,220
a system which models

89
00:04:57,220 --> 00:05:01,025
or implements shooting range game.

90
00:05:01,025 --> 00:05:03,110
There is a handlers shooting range

91
00:05:03,110 --> 00:05:04,850
which is responsible for holding,

92
00:05:04,850 --> 00:05:05,990
a piece of spade which is

93
00:05:05,990 --> 00:05:12,600
the current target
position of the target.

94
00:05:13,460 --> 00:05:18,450
The players can attempt to fire

95
00:05:18,450 --> 00:05:20,970
at the given position
and the shooting range

96
00:05:20,970 --> 00:05:23,960
is responsible for
sending back a response

97
00:05:23,960 --> 00:05:28,270
saying whether the attempt
was successful or fail.

98
00:05:28,270 --> 00:05:31,550
So what we can see
here is a situation

99
00:05:31,550 --> 00:05:33,260
in which we have a shooting range

100
00:05:33,260 --> 00:05:36,650
and the shooting range is
set to target position 42

101
00:05:36,650 --> 00:05:41,500
and then we have a fire
that fires at position 42.

102
00:05:41,500 --> 00:05:46,170
The response from that call is a hit.

103
00:05:46,170 --> 00:05:49,670
After that, the target is
being moved to position one

104
00:05:49,670 --> 00:05:52,910
and finally, an interesting
situation happens

105
00:05:52,910 --> 00:05:55,810
because we get a duplicate
and duplicated fire

106
00:05:55,810 --> 00:05:58,525
at message which was
already processed once,

107
00:05:58,525 --> 00:06:01,205
but then it's being
reprocessed one more time.

108
00:06:01,205 --> 00:06:05,450
Logic is deterministic but the
response that we've had is a miss.

109
00:06:05,450 --> 00:06:08,090
What happened in that
scenario is that we

110
00:06:08,090 --> 00:06:13,790
have a single logical message
being delivered twice

111
00:06:13,790 --> 00:06:16,370
and each of the processing of this

112
00:06:16,370 --> 00:06:22,220
message results in a
contradictory result,

113
00:06:22,220 --> 00:06:25,970
contradictory side
effect, which is usually

114
00:06:25,970 --> 00:06:27,770
far from what we are expecting

115
00:06:27,770 --> 00:06:30,900
from the systems that we're building.

116
00:06:31,780 --> 00:06:36,320
The obvious problem that
we have here is that,

117
00:06:36,320 --> 00:06:38,840
even though the logical
is deterministic,

118
00:06:38,840 --> 00:06:42,440
the state changed
significantly in between

119
00:06:42,440 --> 00:06:44,300
the first processing
and the duplicate

120
00:06:44,300 --> 00:06:46,945
which resulted in that behavior.

121
00:06:46,945 --> 00:06:51,590
What it shows is that the
deterministic logic is not enough,

122
00:06:51,590 --> 00:06:55,230
we also need to think
about the state.

123
00:06:55,230 --> 00:06:58,295
Either we need to
capture the side effects

124
00:06:58,295 --> 00:07:00,980
and basically we'll do
them whenever a duplicate

125
00:07:00,980 --> 00:07:03,680
comes in or we need to be

126
00:07:03,680 --> 00:07:06,920
able to get a handle at
the historical state,

127
00:07:06,920 --> 00:07:10,890
as it was when the
message first arrived.

128
00:07:12,450 --> 00:07:16,180
What we are assuming
in the library is that

129
00:07:16,180 --> 00:07:18,370
the distributed system
that we are talking

130
00:07:18,370 --> 00:07:20,980
about is field of bunch of handlers,

131
00:07:20,980 --> 00:07:25,645
each handler has a separate
dedicated piece of state.

132
00:07:25,645 --> 00:07:29,080
That handler is sole holder and

133
00:07:29,080 --> 00:07:31,750
the only possibility to communicate

134
00:07:31,750 --> 00:07:35,140
between the handlers is
via sending messages.

135
00:07:35,140 --> 00:07:37,705
What each handler is doing is that it

136
00:07:37,705 --> 00:07:40,285
picks up the message
from the input queue,

137
00:07:40,285 --> 00:07:42,040
executes the business logic,

138
00:07:42,040 --> 00:07:43,930
which results in state update

139
00:07:43,930 --> 00:07:47,090
and puts a message
in the output queue.

140
00:07:47,400 --> 00:07:56,305
The main idea or goal of the
library was, as stated here,

141
00:07:56,305 --> 00:07:59,245
to make sure that the
observable side effects

142
00:07:59,245 --> 00:08:01,705
correspond to some
serial execution of

143
00:08:01,705 --> 00:08:06,130
input messages with
atomic commit guarantee

144
00:08:06,130 --> 00:08:08,545
between the business storage
and the output queue.

145
00:08:08,545 --> 00:08:10,810
Basically what we
wanted to make sure is

146
00:08:10,810 --> 00:08:13,930
that whenever a duplicate comes in,

147
00:08:13,930 --> 00:08:17,170
the side effects that
we'll produce on the state

148
00:08:17,170 --> 00:08:20,875
and in terms of sending
of the output messages,

149
00:08:20,875 --> 00:08:29,380
is the same as if there was
a single processing and

150
00:08:29,380 --> 00:08:34,100
the ordering is corresponding to
some possible serial execution.

151
00:08:34,170 --> 00:08:40,730
The implementation of the library
is based on the Azure Stack.

152
00:08:41,280 --> 00:08:44,935
If they run in Azure Function host,

153
00:08:44,935 --> 00:08:48,310
which basically means now
the whole will be messaging

154
00:08:48,310 --> 00:08:50,845
infrastructure available
for Azure functions

155
00:08:50,845 --> 00:08:52,885
is available there as well.

156
00:08:52,885 --> 00:08:56,635
The state is stored
in Azure Cosmos DB.

157
00:08:56,635 --> 00:09:01,150
Specifically, what do
we rely on in terms of

158
00:09:01,150 --> 00:09:03,520
Azure Cosmos DB is that it

159
00:09:03,520 --> 00:09:06,130
provides optimistic
concurrency control,

160
00:09:06,130 --> 00:09:08,650
which is pretty important
or the key part

161
00:09:08,650 --> 00:09:11,710
of the approach that we are taking.

162
00:09:11,710 --> 00:09:18,250
We store our infrastructural
library-driven data

163
00:09:18,250 --> 00:09:20,860
in a separate logical partition,

164
00:09:20,860 --> 00:09:23,470
which also comes with a
consequence which is that

165
00:09:23,470 --> 00:09:25,810
there is no atomic
transactions between

166
00:09:25,810 --> 00:09:28,735
the business state and the
infrastructural state.

167
00:09:28,735 --> 00:09:31,930
We assume that the users are using

168
00:09:31,930 --> 00:09:35,830
Azure Cosmos DB with
session consistency model.

169
00:09:35,830 --> 00:09:39,730
Finally, high performance
scenarios are out of scope.

170
00:09:39,730 --> 00:09:41,845
It's not something
that we focused on.

171
00:09:41,845 --> 00:09:46,225
Good enough performance, whatever
that means, was our goal.

172
00:09:46,225 --> 00:09:50,590
Now, the general idea of the
algorithm is as follows.

173
00:09:50,590 --> 00:09:52,240
Whenever a message comes in,

174
00:09:52,240 --> 00:09:54,385
we check whether we
already processed it.

175
00:09:54,385 --> 00:09:56,305
If not, we load the state.

176
00:09:56,305 --> 00:09:58,270
We run the business logic

177
00:09:58,270 --> 00:10:01,675
and we do not send out
the output messages,

178
00:10:01,675 --> 00:10:04,735
but we capture them and
store in this form.

179
00:10:04,735 --> 00:10:08,515
Now, we attach a
single correlation ID

180
00:10:08,515 --> 00:10:10,945
with the business piece of data,

181
00:10:10,945 --> 00:10:17,050
and then we commit it to the
business logic or partition.

182
00:10:17,050 --> 00:10:20,950
This is where we do the
optimistic concurrency trade.

183
00:10:20,950 --> 00:10:24,970
We do that only if
that entity did not

184
00:10:24,970 --> 00:10:29,500
change since we loaded
it from the store.

185
00:10:29,500 --> 00:10:31,270
If that is successful,

186
00:10:31,270 --> 00:10:36,970
this is basically the
point of no return.

187
00:10:36,970 --> 00:10:41,065
The atomic commit protocol,
once that is done,

188
00:10:41,065 --> 00:10:45,010
the both parts, which is
the business state which is

189
00:10:45,010 --> 00:10:47,515
already committed and
the output messages

190
00:10:47,515 --> 00:10:49,510
will be committed eventual.

191
00:10:49,510 --> 00:10:52,840
Now, an important thing here to

192
00:10:52,840 --> 00:10:56,200
notice is that during
some failure situations,

193
00:10:56,200 --> 00:10:58,810
what can happen is that
the output messages

194
00:10:58,810 --> 00:11:01,375
might have been sent
out multiple times.

195
00:11:01,375 --> 00:11:04,390
However, because the duplicates can

196
00:11:04,390 --> 00:11:09,145
happen with At-Least-Once
message delivery already,

197
00:11:09,145 --> 00:11:12,955
we can say that we are hiding
behind that failure mode.

198
00:11:12,955 --> 00:11:16,165
Basically, because the
duplicates can be already there,

199
00:11:16,165 --> 00:11:17,650
we can produce duplicates because

200
00:11:17,650 --> 00:11:20,380
whoever is processing them should be

201
00:11:20,380 --> 00:11:24,350
and is expected to cope
with that situation.

202
00:11:25,230 --> 00:11:29,335
That was the general idea

203
00:11:29,335 --> 00:11:32,140
about the problem and the solution

204
00:11:32,140 --> 00:11:35,005
that we seek and wanted to implement.

205
00:11:35,005 --> 00:11:42,100
Now I wanted to talk
about the TLA+ Model

206
00:11:42,100 --> 00:11:47,540
and how we use that model
to validate the safety.

207
00:11:48,210 --> 00:11:52,000
The goals of the model
was to make sure that

208
00:11:52,000 --> 00:11:56,050
our algorithm works not only in

209
00:11:56,050 --> 00:11:59,990
the happy path but also on
in the things to implement.

210
00:12:00,360 --> 00:12:04,195
As we all know, it's very easy to

211
00:12:04,195 --> 00:12:07,735
show that the algorithm
sometimes works,

212
00:12:07,735 --> 00:12:10,390
but it's actually
very tricky to figure

213
00:12:10,390 --> 00:12:13,705
out all the further case scenario.

214
00:12:13,705 --> 00:12:19,810
We wanted to make sure that
we have some validation,

215
00:12:19,810 --> 00:12:24,415
some model checking
down to make sure that

216
00:12:24,415 --> 00:12:28,180
we actually are
convinced ourselves that

217
00:12:28,180 --> 00:12:31,480
the algorithm does what
we wanted it to do.

218
00:12:31,480 --> 00:12:35,885
Specifically what we wanted to
check was the safety properties.

219
00:12:35,885 --> 00:12:41,880
Those safety properties were
described as atomic commit.

220
00:12:41,880 --> 00:12:43,710
Basically we wanted to make sure

221
00:12:43,710 --> 00:12:55,450
that when we commit to the
business logic store we

222
00:12:55,450 --> 00:12:58,840
eventually push out the
changes to the output queue,

223
00:12:58,840 --> 00:13:04,090
and also made changes
to be the input queue,

224
00:13:04,090 --> 00:13:06,610
and also we wanted to make sure that

225
00:13:06,610 --> 00:13:13,045
the side effects are
consistent in a way that,

226
00:13:13,045 --> 00:13:18,820
even for duplicates,
there is only one change

227
00:13:18,820 --> 00:13:23,740
in the business state
and the output messages

228
00:13:23,740 --> 00:13:27,280
and the business state
change were made over

229
00:13:27,280 --> 00:13:30,880
the same version of the
business state, which I think

230
00:13:30,880 --> 00:13:35,210
will be a bit clearer when
we get to the formulas.

231
00:13:35,250 --> 00:13:39,130
Finally, we wanted to
make sure that everything

232
00:13:39,130 --> 00:13:43,420
works as expected even
without atomic transaction

233
00:13:43,420 --> 00:13:45,490
between different logical partitions.

234
00:13:45,490 --> 00:13:46,780
Because as I stated,

235
00:13:46,780 --> 00:13:49,480
we start the business logic and

236
00:13:49,480 --> 00:13:55,130
the infrastructural data
in a separate partitions.

237
00:13:56,130 --> 00:14:00,760
Obviously, there were most
of the things that were

238
00:14:00,760 --> 00:14:03,415
in the implementation did
not make it to the model,

239
00:14:03,415 --> 00:14:05,200
and specifically we did not model

240
00:14:05,200 --> 00:14:07,419
the cleanup logic
which is responsible

241
00:14:07,419 --> 00:14:09,970
for clearing the infrastructural data

242
00:14:09,970 --> 00:14:11,920
that is stored by the library.

243
00:14:11,920 --> 00:14:14,875
We did not model the model
resistance that are used

244
00:14:14,875 --> 00:14:18,249
to lower the chance of
concurrency and collisions,

245
00:14:18,249 --> 00:14:21,550
optimistic concurrency
check failures basically.

246
00:14:21,550 --> 00:14:24,320
We also did not model
the exponential backlogs

247
00:14:24,320 --> 00:14:26,360
and processing retries
which are there

248
00:14:26,360 --> 00:14:28,685
to make the library a bit

249
00:14:28,685 --> 00:14:32,985
more performing in the
failure scenarios.

250
00:14:32,985 --> 00:14:36,650
Let's look at the specification.

251
00:14:36,650 --> 00:14:39,440
This is the main part of

252
00:14:39,440 --> 00:14:43,440
the process that models
the single handler.

253
00:14:43,440 --> 00:14:47,900
Most of the labels are folded

254
00:14:47,900 --> 00:14:52,250
and that shows the main
structural specification.

255
00:14:52,250 --> 00:14:55,130
What we have here is
the infinite while loop

256
00:14:55,130 --> 00:14:58,060
and in that loop the
process is trying to

257
00:14:58,060 --> 00:15:00,565
input message forms the input queue.

258
00:15:00,565 --> 00:15:06,890
Then it checks by looking at the
transaction field on the state,

259
00:15:06,890 --> 00:15:12,890
whether it needs to redo some
of the transactional steps to

260
00:15:12,890 --> 00:15:16,220
basically push out the changes that

261
00:15:16,220 --> 00:15:22,345
are still there and not
pushed to the output queue.

262
00:15:22,345 --> 00:15:24,470
After doing that, we check

263
00:15:24,470 --> 00:15:26,530
whether we already
processed the message.

264
00:15:26,530 --> 00:15:28,685
If we did not process the message,

265
00:15:28,685 --> 00:15:31,250
we model the business logic execution

266
00:15:31,250 --> 00:15:34,025
capturing the side
effects storing them,

267
00:15:34,025 --> 00:15:36,305
the optimistic concurrency commits

268
00:15:36,305 --> 00:15:38,270
to the business store
and then pushing

269
00:15:38,270 --> 00:15:44,940
out the transaction
until it's finished.

270
00:15:46,020 --> 00:15:49,900
The main reason why we
use plus calc was that

271
00:15:49,900 --> 00:15:55,220
this syntax was more familiar
for us and secondly,

272
00:15:55,220 --> 00:15:57,860
that it actually
corresponds pretty well

273
00:15:57,860 --> 00:16:02,100
when it comes to the implementation
that was done in C#.

274
00:16:02,100 --> 00:16:05,600
We were able to look at
the code side-by-side

275
00:16:05,600 --> 00:16:08,590
and the specification corresponded

276
00:16:08,590 --> 00:16:10,940
pretty well with what we could see on

277
00:16:10,940 --> 00:16:15,815
the screen looking at
the real implementation.

278
00:16:15,815 --> 00:16:19,200
When it comes to state modeling,

279
00:16:19,200 --> 00:16:22,205
there wasn't that much happening.

280
00:16:22,205 --> 00:16:24,590
Two important notices is

281
00:16:24,590 --> 00:16:27,080
that the input queue
was modeled as a set,

282
00:16:27,080 --> 00:16:34,030
but that set had a
records with two fields,

283
00:16:34,030 --> 00:16:36,280
One which was the logical message ID

284
00:16:36,280 --> 00:16:37,960
and duplicate message ID.

285
00:16:37,960 --> 00:16:41,780
The way how we model
the duplicates was

286
00:16:41,780 --> 00:16:46,705
basically by starting off with
duplicates already put queue.

287
00:16:46,705 --> 00:16:49,665
When it comes to the
business stack modeling,

288
00:16:49,665 --> 00:16:53,960
we were a starting a sequence
which was a history of

289
00:16:53,960 --> 00:16:55,940
snapshots of the states as they

290
00:16:55,940 --> 00:16:58,980
were throughout the
execution of the algorithm,

291
00:16:58,980 --> 00:17:03,270
and that was pretty useful
when writing the invariants.

292
00:17:03,600 --> 00:17:07,320
We also modeled the diversion

293
00:17:07,320 --> 00:17:09,600
for the optimistic concurrency check.

294
00:17:09,600 --> 00:17:12,150
All the other bits were used to model

295
00:17:12,150 --> 00:17:14,699
the infrastructural storage

296
00:17:14,699 --> 00:17:19,330
and the output to which was also set.

297
00:17:19,350 --> 00:17:22,360
Another interesting thing
is that we did not use is

298
00:17:22,360 --> 00:17:26,185
a default termination formula.

299
00:17:26,185 --> 00:17:28,435
What we said was that

300
00:17:28,435 --> 00:17:32,650
the processes were running
in the infinite loops.

301
00:17:32,650 --> 00:17:34,960
And the power termination of

302
00:17:34,960 --> 00:17:37,750
the algorithm was a
situation in which

303
00:17:37,750 --> 00:17:40,930
all the processes were in the lock-in

304
00:17:40,930 --> 00:17:44,230
message and the input
queue was empty.

305
00:17:44,230 --> 00:17:46,660
Basically, the
termination was modeling

306
00:17:46,660 --> 00:17:49,600
the situation in which
the input cue is trained

307
00:17:49,600 --> 00:17:54,715
and none of the handler is
doing anything in specific.

308
00:17:54,715 --> 00:17:59,000
But waiting for anything
to appear in that input.

309
00:17:59,400 --> 00:18:03,100
Finally, the safety invariants.

310
00:18:03,100 --> 00:18:06,025
This is what we ended up
when it comes to safety.

311
00:18:06,025 --> 00:18:13,150
That basically models or
expresses the safety property,

312
00:18:13,150 --> 00:18:16,765
which makes sure that
for any logical message

313
00:18:16,765 --> 00:18:20,520
there is AtMostOneStateChange
happening to

314
00:18:20,520 --> 00:18:23,910
the business store
and most once message

315
00:18:23,910 --> 00:18:28,635
being produced and sent
to the output cue.

316
00:18:28,635 --> 00:18:31,780
Those are the first two parts,

317
00:18:32,040 --> 00:18:35,875
one on the top and one in the middle.

318
00:18:35,875 --> 00:18:39,670
Finally, the one which is
called consistent state

319
00:18:39,670 --> 00:18:44,335
and output says that if the
message is fully processed,

320
00:18:44,335 --> 00:18:51,670
the version on which the
diversion for the business state

321
00:18:51,670 --> 00:18:54,550
and the output message
should be the same.

322
00:18:54,550 --> 00:19:00,459
Basically we make sure that
those two pieces of side-effects

323
00:19:00,459 --> 00:19:04,435
are consistent in a way
that they operated and were

324
00:19:04,435 --> 00:19:06,910
used by a business logic operating on

325
00:19:06,910 --> 00:19:09,830
the same version of the business set.

326
00:19:11,310 --> 00:19:16,225
Another interesting bit is
how we modeled failures.

327
00:19:16,225 --> 00:19:19,225
What we ended up was a combat role,

328
00:19:19,225 --> 00:19:21,790
but basically was a
single instruction

329
00:19:21,790 --> 00:19:24,160
which was filled two main loop,

330
00:19:24,160 --> 00:19:27,760
which basically meant that model

331
00:19:27,760 --> 00:19:29,980
the situation in which
either the process

332
00:19:29,980 --> 00:19:32,500
restarts or some exception is being

333
00:19:32,500 --> 00:19:36,310
thrown and we basically
should have the top of

334
00:19:36,310 --> 00:19:41,605
the stack and go back looking
in that infinite fail loop.

335
00:19:41,605 --> 00:19:44,755
The way that we used
it was that whenever

336
00:19:44,755 --> 00:19:48,235
a failure was possible or we
wanted to model at failure,

337
00:19:48,235 --> 00:19:51,820
we would do either or expression.

338
00:19:51,820 --> 00:19:55,750
For instance, in the commit
state macro we can see that if

339
00:19:55,750 --> 00:20:01,795
the concurrent optimistic
concurrency check results in intro,

340
00:20:01,795 --> 00:20:05,875
we either commit right
to the store or we fail.

341
00:20:05,875 --> 00:20:07,900
That was a general approach

342
00:20:07,900 --> 00:20:11,665
to monitoring those
kinds of failures.

343
00:20:11,665 --> 00:20:15,790
Okay, so now a few words
about the results.

344
00:20:15,790 --> 00:20:18,895
There were expected and
unexpected results.

345
00:20:18,895 --> 00:20:20,905
When it comes to the
expected results,

346
00:20:20,905 --> 00:20:23,020
those were obviously bugs to be

347
00:20:23,020 --> 00:20:25,690
found in various parts
of the algorithm.

348
00:20:25,690 --> 00:20:29,380
Specifically in the
post-failure commits and also

349
00:20:29,380 --> 00:20:33,700
in the logic that is
signs, the transaction id.

350
00:20:33,700 --> 00:20:36,985
As I said, that was something that we

351
00:20:36,985 --> 00:20:40,135
wanted to get from
the model check-ins.

352
00:20:40,135 --> 00:20:42,790
So that's why I called it expected.

353
00:20:42,790 --> 00:20:46,270
I don't want to deprecate
the value of that

354
00:20:46,270 --> 00:20:49,510
because that was
exactly what we want.

355
00:20:49,510 --> 00:20:52,810
However, there were
also unexpected bits.

356
00:20:52,810 --> 00:20:56,110
I would say that the
main unexpected bit was

357
00:20:56,110 --> 00:21:00,295
the fact that when we started
off writing the model,

358
00:21:00,295 --> 00:21:02,530
we realized that we don't really

359
00:21:02,530 --> 00:21:06,190
understand what is it that
we are actually doing.

360
00:21:06,190 --> 00:21:11,020
So it actually forced us to
distill the algorithm to

361
00:21:11,020 --> 00:21:13,930
its essentials and to make sure

362
00:21:13,930 --> 00:21:15,850
whether the assumptions that

363
00:21:15,850 --> 00:21:18,340
we are making which are
pretty important and

364
00:21:18,340 --> 00:21:24,745
also what are the most important
steps in that algorithm.

365
00:21:24,745 --> 00:21:28,690
As a follow-up of doubt,
we realized that there are

366
00:21:28,690 --> 00:21:31,960
actually some extensions
that we could also use.

367
00:21:31,960 --> 00:21:34,660
For instance, I already
said that one of

368
00:21:34,660 --> 00:21:37,860
the main assumptions
that we are making is

369
00:21:37,860 --> 00:21:40,140
that we don't need any kind of

370
00:21:40,140 --> 00:21:43,845
a preparer phase for
the output queue,

371
00:21:43,845 --> 00:21:46,565
because we can always commit
to that output queue.

372
00:21:46,565 --> 00:21:49,285
We can always send a message there.

373
00:21:49,285 --> 00:21:52,060
The duplicates can
happen there because

374
00:21:52,060 --> 00:21:54,100
there's already that failure mode,

375
00:21:54,100 --> 00:21:56,230
which we need to cope with.

376
00:21:56,230 --> 00:21:58,840
But what we realized
is that there might be

377
00:21:58,840 --> 00:22:02,485
some other resources that
are always able to commit.

378
00:22:02,485 --> 00:22:04,540
So for instance, if we have

379
00:22:04,540 --> 00:22:09,730
some write once store,
so for instance,

380
00:22:09,730 --> 00:22:15,625
a blobs store with a
[inaudible] blobs that can be

381
00:22:15,625 --> 00:22:17,530
written only once that could be

382
00:22:17,530 --> 00:22:19,645
an expression to that
algorithm as well.

383
00:22:19,645 --> 00:22:23,210
That could be a supported side effect

384
00:22:24,080 --> 00:22:28,240
that we could apply the same way

385
00:22:28,240 --> 00:22:30,310
as we are sending out the messages.

386
00:22:30,310 --> 00:22:35,440
Finally, what we realized
is that we did not know

387
00:22:35,440 --> 00:22:37,750
the cost of [inaudible]
well enough and all

388
00:22:37,750 --> 00:22:40,600
the memory model implications.

389
00:22:40,600 --> 00:22:42,880
So we had to go back
to the drawing board,

390
00:22:42,880 --> 00:22:45,400
and understand what
doesn't mean that we are

391
00:22:45,400 --> 00:22:52,120
running in a session memory model.

392
00:22:52,120 --> 00:22:55,960
Also that there are no
atomic transactions

393
00:22:55,960 --> 00:22:58,045
between the logical preparations.

394
00:22:58,045 --> 00:23:02,890
Finally, another unexpected thing

395
00:23:02,890 --> 00:23:07,240
was that in order to
create a meaningful model,

396
00:23:07,240 --> 00:23:08,980
we had to understand what are

397
00:23:08,980 --> 00:23:12,070
the concurrency characteristics
of the algorithm.

398
00:23:12,070 --> 00:23:13,630
What are the parts that can happen

399
00:23:13,630 --> 00:23:16,165
independently and concurrently?

400
00:23:16,165 --> 00:23:18,820
But even more importantly,
we had to learn

401
00:23:18,820 --> 00:23:21,910
what are the failure
modes or what are

402
00:23:21,910 --> 00:23:24,070
the failure situations
that may happen in

403
00:23:24,070 --> 00:23:26,500
different parts of
the system as well.

404
00:23:26,500 --> 00:23:29,980
So it actually forced us to
understand the technology

405
00:23:29,980 --> 00:23:35,840
that we're using are better
than we knew it beforehand.

406
00:23:36,870 --> 00:23:42,070
So we did the validation,
we did the model checking.

407
00:23:42,070 --> 00:23:48,700
An interesting note here is
that whenever there was a bug

408
00:23:48,700 --> 00:23:51,895
in a specific location
or in the algorithm,

409
00:23:51,895 --> 00:23:56,425
we were able to find it
on very small models.

410
00:23:56,425 --> 00:23:57,880
And by very small models,

411
00:23:57,880 --> 00:24:01,270
I mean a single message
with a single duplicate

412
00:24:01,270 --> 00:24:06,970
and a single process. Almost that.

413
00:24:06,970 --> 00:24:12,370
In that sense, the running
time of model checking

414
00:24:12,370 --> 00:24:14,770
was never a problem for
us because whenever

415
00:24:14,770 --> 00:24:19,450
there were bugs, we would
get a feedback immediately.

416
00:24:19,450 --> 00:24:22,270
Now that being said,
after we settled on

417
00:24:22,270 --> 00:24:27,070
an algorithm and we checked
it with a smaller model.

418
00:24:27,070 --> 00:24:30,235
We also went with a bigger one.

419
00:24:30,235 --> 00:24:32,665
The bigger that we did was

420
00:24:32,665 --> 00:24:37,120
two processes with six
messages in the input cue,

421
00:24:37,120 --> 00:24:39,430
those are numbers in terms of

422
00:24:39,430 --> 00:24:42,160
number of space that we ended up with

423
00:24:42,160 --> 00:24:46,360
and that was running roughly
two-and-a-half hours

424
00:24:46,360 --> 00:24:50,200
on a laptop, my personal
laptop, that I would consider

425
00:24:50,200 --> 00:24:55,490
a high-end laptop based
on current standards.

426
00:24:56,220 --> 00:25:01,210
That's it that I prepared for
you in this presentation.

427
00:25:01,210 --> 00:25:04,330
If anyone wants to know more about

428
00:25:04,330 --> 00:25:07,855
the work that we did around
TLA class and the library.

429
00:25:07,855 --> 00:25:11,620
We are posting about that
at exactly-once.github.io.

430
00:25:11,620 --> 00:25:15,250
I'm also available on
Twitter at Masternak.

431
00:25:15,250 --> 00:25:18,625
If anyone wants to chat,
please reach out to me.

432
00:25:18,625 --> 00:25:23,120
I'd really like to share
ideas. Thank you very much.