E3: Nora
Nora injects failure to build reliability - and then went on to build a startup showing companies how to uncomplicate their own systems
The Power Hackers series continues with Nora Jones, an engineer/entrepreneur from the world of chaos, describing her journey from making developers more productive to intentionally injecting faults and failures and studying systems effects to make systems more reliable and even engaging clients on improving the reliabilty of their own complex applications!
Transcript:
all right hello everybody how are you
doing tonight
we are coming live in the power hackers
series and i have the illustrious
dr jones actually she she hasn't done
her phd yet
but i'm just speaking temporarily in the
future she's going to be doing her phd
and i really want to be able to call her
dr jones so
dr jones are you here with us tonight i
am
that is fantastic i am going to
cut ourselves over to the power hackers
series because
as those of you who've been here before
know this
is about some pretty awesome people that
i have met in my
journeys that they know the code and
in the case of dr jones here they
know a lot about chaos and so we're
going to talk a bit about chaos and how
things
can break massively so there's going to
be
a ton of entropy for those of you who
are upset that i did not
actually finish uh rolling my normal
amount of rolling of d20s in the intro
don't worry
all of the entropy will be coming from
ms jones
alright so let's let's welcome her
to the scene hello nora
okay so here we are we got that up on
the screen we got some people show it up
no no no lord is ozil she she she
doesn't she's never told me she had a
phd i just heard her name and i heard dr
jones
in my head and i said it to her so
she always tell me stop that i think
you've gone up to a master's though
already right
i'm in i'm in my graduate program right
now so not quite yet
like a like a phd graduate okay
okay so it's still too early for that
all right well
i'm still i'm hoping everybody everybody
in chat you know you can
you can you can all hope with me so
we're gonna we're gonna hope that we can
someday call her dr jones
um ms jones you have a welcome here from
the audience lord of zazzles welcoming
you to
to the channel so and they have some
pretty cool emotes as well
so let's let's start off with you are
currently on location you're traveling
across the country
you're on a mission of mercy you are
just
heading what what's going on where
you're starting up a company and that
means a lot of movement right
yeah yeah so um separate separate from
that
recently moved out of the bay area so
traveling down to
uh to colorado for the next couple
months oh very nice
yeah get away from the fires kind of a
good time to get out of dodge
yeah exactly just the orange orange sky
nsf
the other day kind of now that's right
you were up
in the city right you you went when i
met you you were living
mostly in san francisco and commuting an
obscene amount of time
south into the bay to netflix world
headquarters
and oh yeah that i mean how what is that
like every day i mean you were doing
like two hours of commute each way and
that's like four hours of commuting
it was a lot but netflix was a really
great place to work
i was not ready to live in south bay so
it was uh
oh and look at that last miles just
showed up with a raid
he has raided us with a party of 38
people welcome raiders
this is uh ms jones we're doing the
power hacker series
and good to see all of you here tonight
thank you last miles for the raid that's
much appreciated so i am not going to do
any guitar tonight they they
they're referencing a accidental guitar
guitar ring that happened on a
programming stream and
it's just this is not that kind of
stream
have you have you done karaoke on your
stream yet we're not gonna
talk about that do we have to talk i
mean we could
uh i i you must be mistaking me with
another guy right that's that's a
different guy who did the karaoke isn't
it
is that are you sure you have no it's
okay last miles actually just played a
clip of it in his channel so there
no one's gonna believe a word of what
i'm saying okay that's
hello fizzly twitching fizzly bit
twitching
visibly twitching i'm actually gonna
learn how to read um and
so welcome welcome and uh emacs oh he
played the vim clip okay i actually was
talking about playing vim until you die
that day so i was i do you have a
like you're usually in like you were
working in
f sharp for a while so you were used to
like
visual studio so it's kind of like
that's a bit of a push on the editor
front in the vim
e-max wars you're not going to come down
on the side right i'm not
no okay okay so i'm not entering that
battle
all right all right raiders everybody
else now now you know
dr jones is not gonna talk about hk
cupcake do not talk about that so
um you're actually aware of hk cupcake
and wuba dub
dub wub dub dub is alex uh you know
hacker from netflix so you've actually i
think met him before
um but uh they they both knew me in hong
kong which is why she's a hong kong
cupcake
um but anyway yeah last miles has
grabbed a drink
all right let's get on with the
interview no no singing no playing funky
music no editor songs we're not doing
any of that nonsense
guys i think i heard you sing sublime
and code and sublime though
i feel like some of these are accurate
wait isn't this your interview
let's let's she's giving all the details
away all right everybody don't don't ask
her any questions okay whatever you do
tonight don't ask her any questions
okay so don't yeah visuals okay so last
miles doesn't approve
visuals studio i also i also don't
approve of visual studio
but you do approve of f sharp now when i
first saw
f sharp the first thing that hit me when
you were talking about it was you were a
fortran
programmer and i understand from that
day
that that was completely wrong and f
sharp is not a new version of fortran
that's been enhanced with a sharp by
microsoft so
what is f-sharp can you just is that a
common language and i just happen to be
in a
whole for 10 years and like i missed it
or
why do i want to program in that it's
not a common language
uh it was developed by microsoft it was
like their
their functional answer um to a net
programming language
um so i was working at a company called
jet.com back in 2015.
um they had been around for a couple
years before i joined
but they weren't in the bay they were
not in the bay this was back when i was
living on the east coast
i was i was in new york at that time and
um
they were trying to build an e-commerce
website and they wanted their pricing
engine
to be really strong they wanted to do
they they wanted to do a lot of
computations um basically very quickly
they wanted to be able to price match
folks like walmart and amazon and then
go a little
bit lower and fortrans they had to be
very fast at that regard
but most of their early programmers had
only experience in c sharp and so rather
than using
o camel they decided okay let's meet
them where they are today like they'll
know the library as well and so they
chose f sharp um and
yeah the organization grew a lot and so
every developer that came into the
organization had to
learn f sharp and it ended up being
really fun um
wow that's pretty cool i mean i don't
usually describe programming languages
as fun
and i there don't listen to anything
that they're saying about regular
expressions in the chat that's just
inappropriate conversation and we all we
even have an
emote for that there's a custom-made
emoticon
that you can give expressions yeah you
can actually
drop the no regular expression i will
gift this to you whenever you're ready
to jump on twitch so
this is you will have some of the best
emotes that are available
on twitch oh and by the way dr selzam
thank you for the follow much
appreciated
um so that is kind of
but you didn't did you actually move out
to the west coast as part of jet
or was that after no that was after chat
so um while i was at jet
it was it was a very fast company in a
lot of different ways so i was the first
engineer
hired to um focus primarily on developer
productivity and developer efficiency
which in f sharp it's like i mean in the
organization you know
when you're having a programming problem
at any organization
you usually google it to an extent and f
sharp was so
unused that so many times you'd google
something and it would come up with zero
results
and you're just like i'm just gonna
reinvent the wheel here like i'm going
to try to figure this out
the way i would have figured this out in
this other language that i know and i'm
going to try to transfer it to this and
so
i was the first engineer hired on
developer productivity there which meant
like a lot of
reinventing the wheel there too and at
that time
we were having a ton of incidents like
wait wait don't run past that what what
are incidents tell me more about that
it's a good question because it means a
different thing in every organization
but basically
i would call an incident something that
talked about stuff breaking aren't you
yeah to stop what you're doing
uh at any given time drop everything and
stop the bleeding it's like you know
someone coming into the emergency room
a software perspective that that's very
strong analogy
sorry no no i mean that's i'm sure
you've seen a lot of
emergency room code probably doing this
developer productivity stuff i mean
basically that's what it is someone's
coming into the emergency
room and it's not like you know you just
say oh like you know
take a seat like let's take a second if
something really bad is happening you
address it and you find
the doctor that is strong enough to
address it and that's exactly what
happens in the software world
too and at jet.com we were having
incidents
um a lot like every day so our marketing
team was killing it
you know they were there were a lot of
folks signing up for the website
we people knew about us everywhere but
engineering was really trying to keep up
which was leading to a lot of quick
feature releases
without um a lot of
testing and rigor and so um that's a lot
of code
going to production with no testing
yeah and it is a new language too and so
there were so many nuances to it i mean
it really took every engineer about
three months to ramp up at the company
because
most folks had not used f sharp before
at one point we were like
recruiting folks from australia because
there was a big f sharp community there
wow and
somehow they had convinced them to move
to hoboken new jersey
uh hoboken is a great place if anyone's
from there i did really like it
but it was it's very different than
australia right
so we were doing we were doing something
pretty cool with f sharp
you put the australians in hoboken we
did yeah oh
oh my how did that go
uh i mean they were doing what they
loved they were doing enough sharp
but like yeah we were having incidents
and so um
we were it was a really interesting time
to be in like developer productivity and
developer efficiency there because part
of my role was not only to help
developers move
quickly but it was to help developers
move quickly
and safely and so
that was how i got into chaos
engineering but i had stayed in
i had satan new jersey while i was at
jet.com i started looking into
a bunch of different things to try to
help with judge journey
and i made a chaos engineering tool
very simple tool and i made it enough
sharp it was just to
um basically kill nodes it was a version
of chaos monkey and so i'd kill node
so that people weren't treating nodes as
um
as pets but they were treating them
rather of us cattle
as morbid before
sorry if anybody's a vegetarian here
i need i need a better metaphor for that
yeah so i thought i was building a tool
uh it worked
pretty well uh like we found a lot of
vulnerabilities but i also
inadvertently took down qa for a week
with that tool
and slowly but surely i realized i
wasn't actually
just building a tool but it was
instituting a culture change
right to get people to think more about
the failure modes of their code and
their deployments
um when they were kind of in the
development phase and so it kind of
changed the mindset for people to like
not what happens if this fails but what
happens when this fails and so
i blogged a little bit about those
experiences um
and i basically brought you to casey's
attention i'm guessing
and that that's what brought me to
netflix and that's what brought me to
california
right okay so we're gonna we're
definitely gonna talk more about
that because that's that's really where
we're going next but before i
do transition off this f-sharp topic
i have to know you you say that you had
a bit of working experience with a bunch
of
australians in hoboken and i
i can't entirely let that go because how
would
an australian pronounce the word hoboken
i i i have no actually i can't remember
i have something of a comical everything
you should say hoboken no joking you
know
okay okay so no
like no no australian accent kind of
rubbed off you didn't pick that up
around the office
okay it wasn't no reverse cultural
immersion kind of happening
okay all right i was going to ask but
that's okay we're we're good without
accidents um
so i don't fully understand this image
i'm looking at this is
f sharp and it it's like a happy pipe
going to i love and
maybe it's a sideways triangle what what
what what is what does this mean for our
viewers that may
not be f sharp people it's it's the
piping function enough sharp it's
probably the most
most used uh it's it's the most used in
f sharp and so the f sharp community got
very creative with their logo
and so so piping you're talking this is
like monad
like the way you can go oh yeah yeah
okay
so we're going to geek out i mentioned
um
the person that was mentioning
functional programming languages earlier
that lord azazel was very interested in
functional programming languages uh and
they brought that up several times to me
and i said okay we're gonna talk about
functional on another stream
but let's let's let's leave basically
it's nice like it you can pass
parameters to a function with with the
pipe and so
it's a cute way of saying okay i love
that that's that's very cool
um i you know let's get less functional
for a moment
like i want to sort of tend into the
dysfunction world
for just a minute and we're going to get
into that journey to the west coast but
somewhere in this journey not only did
you know jet
jet got acquired the technology was well
loved
and um but you had moved on you you had
joined netflix you were
pretty much the last you were one of the
last people to dance at netflix
i was i i actually totally forgot about
that i'm not i'm gonna we're gonna bring
that up later
don't you worry i've got that right here
no no no no no no so we're not
no no i would not i would not i would
not but um i think i think that's you
have to pay that price by
going to work for the company if you
want access to those pictures um
so i'm in i'm in a full inflatable
dinosaur
costume i wasn't gonna tell them
all right okay okay so we're we're
moving back
now at that point you were in california
briefly and then
you were on the other side of the world
you you got so into this
dysfunction this sort of like stuff
failing kind of mindset that that this
became
sort of the whole world of chaos
engineering and
this was about not i mean we we used to
like sort of wear lab coats and have
like clipboards and say okay we're gonna
go do computer science with a text
editor you know whatever we would do
and bingo notation and all that kind of
there's formulae you know we would
pluralize formulas formulae
that was fine back then but you you you
were watching
all of this breaking apart all around
you and you were going if i'm seeing a
system it's just
i don't know if it's going to break i
mean like you would see
like the chaos that was hiding behind
some of the code
and you thought like i want to be around
some people i want to do a degree in
that
that's that is that
the mindset or like how did how did you
make that cognitive leap because this is
a part that i never understood when i
met you in california you went from
i'm writing code and things are sort of
order and developer productivity to like
let's just look at failure let's let's
study failure let's
study systems breaking apart let's stop
let's kill the cattle
let's let's take the nodes
and just just embrace the fact that
software and
servers break and let's learn from it is
that
how did you make that transition to like
going to a course from
netflix yeah i mean that was about a
year
and a half into uh my journey at netflix
that i signed up for that program and i
think
even when i started netflix i you know
if someone told me i would
enroll in a graduate program midway
through i would have
laughed at them but sure enough i did
and it was it was a pretty interesting
story i mean like
um i mean we were building a really
interesting system at netflix we had
um at netflix we were working in java
right and so
most folks at netflix were working in
java and we had developed something
that took advantage of the fact that
most teams at netflix use the same set
of java libraries and we
created the ability to inject a random
exception
or inject time between function calls uh
that they could um directly use
in whatever services that they wanted to
and so um
every team automatically had this
ability to do
chaos engineering at netflix and because
they had the ability everyone suddenly
started doing it overnight right
sort of yeah okay
that's the whole story of course yeah so
we we spent
i mean it was a team like we were four
back in developers and i think all of us
were coding like
90 of our time and we had built this
really cool system
okay uh we we like wrote white papers on
it we were able to run it in production
where we would
so netflix had this key business metric
called stream starts per second
and that was what netflix used to
monitor
if things were going wait wait before we
go into the monitoring stream starts
per second for everybody who's here
that's every time you hit
play on netflix that's a stream start
so we we call that a stream start
doesn't matter if you're playing in the
middle of a show or at the beginning of
a show or the end whatever
if post play is not working if auto play
is happening all of that counts as a
stream start including
videos that show up when you don't want
to see them
on the home page right yep yeah all of
that counts as stream starts
so that was like the key business metric
at netflix which was great like
and and you know going back to what guy
said earlier what is an incident
i think at a lot of organizations if you
ask someone what is an incident you'd
get a lot of different answers from like
pr to legal to even different
engineering teams i think if you asked
anyone at netflix like what is an
incident they would probably mention
sps they would probably you know and so
that was a really cool thing about
netflix culture is that everyone was
aligned around
hey this metric matters is very
important
okay um if it was
in the cool thing about netflix is
netflix traffic is very predictable so
we have that benefit right and so
we could predict at certain times like
where the sps was supposed to be where
the stream starts per second where's the
speed
on a tuesday at like 11 a.m or uh after
a soccer game in europe like
um and so um if they got too high
or too low like that you know triggered
paging and a potential incident
department because they all want to be
paged
when we're on the weekend chilling out
yeah i think yeah exactly
exactly um and so uh yeah so we
developed a platform
to do uh chaos experiments in production
where you could basically create a
hypothesis like i am going to inject
failure here or i'm going to inject
latency here between these two calls so
i'll give an example like
um if you all uh think of the bookmark
service at netflix so
uh the bookmark too yeah basically
like if you're if you're watching
something on netflix and you leave
to go like make coffee or make dinner or
something and you leave your stream
at like you know a certain time like 33
22
it will remember where you left your
stream right and when it works correctly
right
yeah yeah because i i seem to have a bad
bookmark service because i've been
watching i've been going through
the next generation again and i seem to
like keep getting an episode that i
played through
so that failure would be like the
bookmark service not working
yeah right okay so they didn't do enough
chaos engineering in that team well so i
actually have a really interesting story
about this
oh yeah right and so um everyone that
did a chaos experiment at netflix like
would do calls between services and it
was always under the hypothesis
of if i inject failure here if i inject
latency here
i expect sps not to change like this
shouldn't impact our key business metric
and so the idea was like oh bookmark
shouldn't be a critical service like i
don't expect
sps to change right like if i can't
access my bookmark
it probably falls back to zero zero yeah
on the video right and so people should
still be able to play
so really interesting thing that is what
happened
but in this particular chaos experiment
we saw
uh sps have an issue
and so we were scratching we were
scratching our heads for so long
and the reason why was not because like
it brought netflix down
it was because when it goes to the
beginning of a stream what are you going
to do
you're going to spend a bunch of time
searching for where you were actually in
that stream
so it did cause sps to go down wow
yeah it was a very and it took us a
really long time to
figure out that was what was happening
and so even though like we had fallback
mechanisms and it wasn't necessarily
bringing down the site it was like
that failing actually really does impact
the user experience and it seems obvious
in hindsight but
you know it had us learn a lot about oh
we're actually causing we might be
causing a user an issue here
and i really like that word you use that
l word because
learn is something that i had never seen
outside of netflix
being referred to as something failing
so we would just
and and this happened when i first
joined netflix they would say you know i
saw like an entire site go down
like you know that was related to like a
launch and they went okay and then we
sort of like you know
worked our way through it and then later
on they said okay we're going to meet
later on for an incident review and
we'll take away some learnings from that
and i was like right that's that seems
very positive that's a very positive way
to look at
because usually the question that i
always get when something major happens
to netflix they're like you must have
fired that dude
yeah i i mean i remember the house of
cards too
like you know like we we launched it
like early by accident
and somebody just changed the date you
know whatever you know and then
that you know it just it got out right
and so people had just watched the whole
thing and then
everyone was asking me at the time like
okay so did you like find out who did
that and like fire them i'm like
no i mean that wasn't that's not how we
operate it's
right and you were involved in a lot of
those
learnings and the ins i mean your team
would you know
sort of consult on some of these things
and how to inject those failures and
sort of
write things up and and there's a lot of
documentation that goes with that too
right
yeah yeah absolutely i mean we we had
spent a lot of time building this
tooling
um and you know in one of the papers we
wrote like we spent a lot of time
explaining the algorithm that we
developed to prioritize the chaos
experiments to create them for people to
analyze the results
and then at the very end of the paper it
was like if you you know the title of
that section was
if you build it they might not come and
so
we have very good
we had spent a ton of time and energy
and like excitement honestly building
these algorithms and building this
platform
to do this and we were the only ones
using the tool it was the four
of us mostly using this tool
getting a lot of learnings we were
getting a ton of learnings
um but like the mental models of the
system the mental models of how
bookmarks worked and how search worked
and how playback worked
we were refining our own mental models
which we weren't on service teams at
netflix we were on a like a reliability
team and so it was
it would behoove us to transfer those
learnings to the reliability
or to the service teams that were
operating bookmarks that were operating
search so we're operating playback but
they weren't running the experience
experiments and so we were trying to get
them to run them more
which is why like i took a lot of time
trying to understand
why they weren't using the tooling and a
lot of that was because
i mean they had other things going on it
wasn't it wasn't their main job and like
right when they i would watch them from
a research perspective use our tool
and i would watch them stare at the
forum and
when i like on the forum we had a box
that said how much sps do you want to
impact and they'd be like no no no
i don't yeah none please
so do you think that was a ui issue just
the way it was being presented
well so we were we were for back-end
engineers we didn't we
like we didn't have experience in ui and
ux but it was
it was so important and like through
that journey i'm
you know i'm so adamant about the fact
that
these back end teams and these
infrastructure teams
need some sort of designer needs some
sort of
researcher but a lot of time as backend
and infrastructure developers we're
taught that like we can kind of do every
role
and when you're impacting production
systems
and you just give someone a giant form
and you say hey here's the most
important business metric at netflix
just tell us
what number you want to impact people
are going to
like i don't want to
and you'd have to try to explain to them
that like no if we learn something we
can actually build a more reliable
service
and net positive on this
was kind of the goal right okay so there
was kind of a difficulty there with
human
factors with interface with because
humans are
a big part of this and the way the code
and so you basically took all this
learning and you said all right i'm
going
i'm out i'm going to sweden uh i'm going
to go
deal with some of the world experts on
this and we're going to write some
papers and stuff
and what what what is that i mean
you're you're in sweden and you got to
talk about some of these things i mean
you were
what what sort of people were there i
mean these were you i think you
mentioned there were like some
flight engineers and like yeah it was it
was a pretty wild experience so
uh yeah i forgot that was the question
you originally asked but no it's fine
you know
there was so much of a human aspect to
this and i even realized that at jet too
it was like i
had built a simple tool but it was
really about the culture and the people
and how they were using it and how they
were thinking about failure and i think
tools can be catalysts to change in a
lot of organizations like they can get
people thinking in a different way but
they can't teach you how to think
right they can't do the thinking for you
and so you can only do so much and so
i really i felt like i wasn't doing the
full extent of my job unless i was
getting some of that insight about the
people too and helping them think about
failure a little bit differently
and integrating some of those learnings
into my tool and so
i was speaking at a software conference
i heard john allspa talk about this
program he was in
at lunch university in sweden
it was a human factors and system safety
program and he was basically stating
like
software has so much to learn from all
these other fields that have studied
system safety for years longer than
software i mean software really hasn't
been around that long
and you know you see some fields like
medicine and aviation and
all these other fields that have kind of
rockets blowing up
they've they've gone through what we've
gone through and they've like they've
seen it wrong in a lot of cases and so
his talk just kind of blew my mind i
and i wanted to go learn more about it
and so i
um i signed up for they have like a
learning lab for a week and they're like
you don't have to sign up for the full
master's program you can just come for a
week
uh you can learn what it's all about and
i
you know i went with the intention of
going for a week and i when i left i was
i was signed up for a graduate program
but
well done on on jobs i was like i don't
know
it was it blew my mind and i like i was
in so
in the class we it was day one everyone
around the room and introduced
themselves there were about 30 people
and so somebody
was like wait day one i can cut back to
that frame i i just i had you in front
of the school house here
and i felt bad for the viewers that were
one this is actually the university in
sweden
that we're looking at here so yeah so
people going around the room someone was
on ground zero for 9 11.
um someone was a like a construction
engineer
for the tallest building in la someone
was um
uh in the er for one of the busiest
hospitals in germany
there were folks all around the world
and then it gets to me and i'm one of
the last people in the room and i was
like
i'm a software engineer at
and netflix all because everyone had
told these pretty wild safety stories
and it almost came out like a joke
and so people started laughing and they
go where do you really work
and i was like that is sorry if that's
actually
where i went when your episodes stopped
playing like that
yeah um and so they all kind of left
because one i was the only person in
software in the room and two
like it was netflix right and so they
didn't get it they were like why
why would netflix care about something
like this and i was like well you know
software in general needs to care a
little bit more about incidents and
about safety because people are using
the internet for things we don't even
know what they're using them for
every day and they rely on it and so
this mentality of just thinking
hey it can fail and it's fine if it
fails like actually really impacts
people's lives and like
you think about it at netflix like we
used to make jokes all the time at
netflix like because
when netflix was down you know we'd have
screenshots of tweets that said netflix
is down my life is over but
it really did impact people's lives like
especially i mean if you think during
the pandemic
oh my gosh people turned into netflix
more than ever especially like parents
you know when they're relying on netflix
like at a certain time of day working
every day if that's not working that's
going to really throw off a family's
life
and so i think keeping that in mind and
keeping your users in mind actually
helps you as
a developer um which is i i think we're
evolving as the software industry there
and thinking about failure a little bit
more
and taking it more seriously but
completely agree
and that's why we need to be more
serious about our chaos engineering
which is and so shortly after that you
came back to the us you were doing more
netflix things you were
dealing with people that were not
injecting failure with this beautiful
tool that you'd built
with team of four and you wrote a book
yeah i co-wrote a book with uh my my
netflix colleagues um
just about experiences with chaos
engineering we we put a lot of human
factors
in system safety stuff that's there too
i have the actual book
like booth showing here i think with
most of the authors that are
like i see aaron all right everybody's
hanging out and we're
look i actually have a signed copy of
this book with all of the authors
autographs i feel very good about it i'm
not going to show it on stream it's
it's pristine condition so
and by the way there was one thing that
kind of came up while you were talking
about that was
with the aerospace stuff last miles it
turns out has been dealing a bit with
aerospace failure and he does like to
talk about when calculations aren't
right because that was actually the
subject of his stream
last night oh that's pretty cool and i'm
nom nom come on knock it off that's
that's cracking that's not hacking
we're talking about the interesting part
of doing clever stuff with software and
putting things together
so okay yeah the internet was down
we have a newbie yes it is a great book
by the way i've got this book
you can't have my copy because it's
signed but it is available
i guess on amazon is that safe to say
yeah then the newest book is available
on amazon the one you're showing on
screen here
uh is is they can't get if that one's
actually available for free
on o'reilly's website but we wrote that
a couple years ago and so we wrote a
a longer one about 200 pages with a
bunch of contributing authors um
uh casey rosenthal and i co-wrote that
this is casey here in the middle
that just came out a couple months ago
um so you can you can buy that on amazon
now but um yeah
it's it's a number of different um a
number of different
folks from from various companies and
how they're implementing chaos
engineering and how they're thinking
about it and so it
was really fun okay so if you wrote a
book i mean now you've
written the book on chaos engineering
and you can
like hit people over the head and say
inject failure in your tool because this
is the book and you know here's
here's my name so come on and use it did
like did you have a little more success
as a result of this or
or how did that kind of go from there
like
at the netflix community i mean were
they more it seemed like the whole
world was getting a little more
interested in chaos engineering you had
done some public speaking about that and
a bunch of other people had been talking
there's a chaos symposium
i understand and and then it was not
cool to call it chaos for a while i got
a little confused there
we we have to call it reliability
because chaos
might be might be a bad thing is that is
that a fair description or
um yeah i mean i think um
i think it had an interesting name and i
but i think the name
confused a lot of people because it was
about engineering the chaos that was
already in your system
it was not about creating chaos it was
also not about chaos theory
i had a lot of people interested in
chaos theory contacting me and i had to
tell them that it was that this was not
that was
um but it was um yeah it was an
interesting experience i
you know i got to speak at re invent
about it which i
think oh wait wait wait wait wait no no
no we're gonna
we have we have a shot of that because
nora actually was on
this wasn't just any conversation at re
invent
this is not the right picture okay what
what did i oh here it is i have it in
the intro
this is i have your your picture here
from
this is nora you can't you can't really
see in this shot she's she's standing
over here and she's walking over to the
podium
uh doing a keynote at aws reinvent
is that am i describing this picture
right i mean is this the right yeah yeah
yeah that's okay
so she and by the way you're also
getting some mentions here from some of
the
people in the chat brenner dev just said
i applaud
you for having the what if mentality in
addressing worst case scenarios she
actually deals with every case scenario
so they don't happen in production i am
an avid proponent of that philosophy
that's that's well said brenner um but
okay
lord of zazzle might pirate your book uh
let's not
discuss piracy as a means it's free
isn't it
wait a minute why are you pirate that's
right
okay i think pirates of that's free if
you really
if you really need a copy please don't
pirate it i can say okay
yep okay so and last miles likes is
buzzwords
and uh there shouldn't be a presumption
that all software is inherently unsafe
some coding guys so i have my own
opinion on that
all of you here know about my opinion
about regular expressions and how little
i trust them and
how often i've seen them break but we we
could talk about that uh last miles
wants a signed copy
that i don't know if we're gonna be able
to arrange that last miles but you know
we
will see what we can do let's uh we'll
discuss that after the stream
now tell me about this is an insanely
large conference stage that you're
walking on to
i've i've never done it i've never never
spoken at aws
at the re invent most of netflix's stack
is on aws there's a lot on the line
there's
the what fifty thousand hundred thousand
people watching this event i mean how
how does that feel to like have to do a
keynote
i mean this this to me is like
a scary scene i mean i i'm walking on to
that stage with that amazing backdrop
and like monitors like flying everywhere
and everything going on like
how does that feel it felt i mean it
felt awesome it was
it was a really great experience so at
that year's aws reinvent so that was
2017.
um so it was almost three years ago now
but i was doing a longer session where i
got into the technical details of what
we had built and that was a 40-minute
session
and then the session that you're showing
here was my keynote session which was
about 12 minutes and so
uh at that point a lot of the industry
had not heard of chaos engineering so i
had to convince everybody of the value
of breaking stuff
[Laughter]
that part was really nerve-wracking like
if you give me 40 minutes
sure i can imagine i can do all of it
but
so it's harder to give a short talk than
to give a long talk
totally totally counter-intuitive you
see this stage right here it's like
i don't i i enjoy giving talks i enjoy
connecting with folks in the audience
and seeing them
but in this talk i actually couldn't see
anybody
uh the light the lights were very bright
and so i'm just
that's a good thing for 12 minutes and
trying to convince folks of the business
value
um but i got i got a lot of um messages
afterwards i had a
there were a lot of organizations
afterwards that were like
well we really do need to change our
mentalities and philosophies and so
i think we saw a lot of changes to how
the industry thought about chaos
engineering
and beginning to think about it at all
after that and so it's been really
awesome to see how the industry's kind
of adopted it
um sure and i actually just dropped the
link to that talk
in the chat people can check that out um
i mean we're gonna
we'll put some more links to some of
your other stuff as well um but if you
do want to watch that take a look later
we have the live nora jones here so you
can you could check out the pre-recorded
nora jones later and thank you
by the way for the follows i just missed
from ef wanda nashanth
and uh dr selzam so thank you very much
and there's a bit of a argument i'm just
i'm not gonna
now this is not a paid paid engagement
here but
you know they are offering to pay for
signed copies
that's going on right now in the chat so
apparently
you can handle that business on your own
so um
but anyway apparently there's a lot of
demand now for the chaos engineering i
and so uh how is
yeah they're they're just okay it's
terrible when they have that mindset
nothing could ever go wrong because
nothing could ever go wrong right nora
right right totally yeah so it's like
as soon as you hear those words like
what what if this fails it's like no
it's
it it will fail like how is this how is
this going to fail
it's nice to get in you know kind of the
the pre-morton mindsets although it is
it is pretty much yeah
um before the death of the capital
engineering it's like if this
if this software fails like how is it
going to fail
and giving people an open space to talk
about that is actually
a lot of fun okay uh and it and it can
um
it can honestly ease people before a
launch too it's like oh maybe things
aren't so bad or
oh wow we actually probably need a
couple extra weeks or months on this
device and
getting them talking actually you helped
co-host a
um an event uh that that chaos symposium
i forgot
the name of it but it was something that
uh some of the previous chaos engineers
from netflix and some people that had
some startups going on and
i think you were involved in a couple
organizing committees and
stuff like that so i mean did you see a
lot of chat in the conference scene
about chaos engineering over the years
after the aws talk
yeah oh my gosh i saw it like completely
uptick
okay so every programmer is now trying
to bust their own system
is kind of the new world that all of you
need to break
your stuff okay so that's this is a
great mode to get into oh and we even
have a pro debugger showing up marion
who caught two of my bugs on yesterday's
stream
maybe three i forgot i tried to forget
and
risky banana here who says my software
never fails because i never launched
before getting sidetracked by another
project that's
also a possible solution is never to
launch code because only code that's
launched will break
okay so you never write code it will
never fail this is
literally my github bio coming from
brenner dev
if you are not intentionally breaking
your own software and development
someone else will
in production or nora jones will break
it in production with her tool
which actually leads to our final
major topic in your hacking journey
which was you it wasn't it wasn't enough
just to go
on at netflix continuing with this whole
world of chaos engineering
and tool management i mean obviously
they already had those tools that that
case has already been won
netflix is very interested in destroying
itself in order to make itself more
resilient
so that that to me seems like success
but
you went on from there and now there's
something
a little more stealth mode and i don't
know how much we're allowed to tell the
viewers about it
but is there any detail you can give
them
about what happened in the post netflix
era
yeah i mean well like uh i wasn't on the
chaos engineering team the entire time i
was at netflix i
um i mean i started getting really into
learning from netflix's incidents but i
started looking at netflix's incidents
to help inform the chaos engineering
tools
when i started looking at the incidents
i realized there was so much more data
and patterns to them that could help
just beyond the chaos engineering tools
like
it could help things like different
aspects of the business it could help
with team allocation and team size it
can help with quarterly planning
it could help with a number of different
things just because of the patterns that
we could see in the software like
um guy i'll tell you about one of my
favorites oh please oh please
i don't know if you were there for but
you'll you'll really love this so it was
like um
we had a like deploy freeze right around
the holidays so the holidays are
netflix's like
highest traffic time of year um
yeah and so at that time like we had
kind of like pause on deploys a little
bit
and um so what are developers doing
during that time
they're piling up a bunch of stuff to
deploy later right
and so uh they're not exercising those
muscles and so it had been
it had been a while since folks deployed
something it was a little bit after that
and when you don't release early and
release often what
you when you just pile up code and that
means that all of that code is going to
be well tested and worked perfectly when
all of it drops
totally and they're they're so excited
to release it because they've been
working on it for so long
they're polishing it right they're
getting rid of all the faults so that
nothing could go wrong when it does
launch
yeah and so uh one developer had like
you know released something
and um and they saw an issue with it and
so they decided to turn
it off via like a feature flag and the
way they turned it off was they
um like had set the uh
feature flag value to something that was
like arbitrarily high
oh can you explain quickly what a
feature flag is before we
hit that real quick yeah yeah so it was
basically like a numerical value
associated with the
particular feature that they were
developing just to um
because they don't want to launch
everything at once right they want to be
able to kind of like
control it while it's in production
without doing another code push
okay yeah exactly and so they um
they were trying to get it they were
trying to do it without another code
push they found some issues with it and
so they could turn it off
um safely but the way they turned it off
was they went to this ui where you could
type in your feature flag number
and the way they were turning it off was
they were like oh i'm just gonna set
this to like an arbitrarily
low value i can't remember all the
nuances of this but wait this sounds
familiar
i think i do remember this incident they
set it to negative
3 billion oh i do remember this
which is um more than the max integer
that
are more than the min integer that java
can parse
so it broke everything everything
just bad things well it wasn't this
developer's fault like i
felt i felt so bad for them like and i
when i went and talked to them they were
like
i can't believe i did this like i should
have known they hadn't been practicing
this in a while and i was like
can you show me where you um where you
entered this value and they were like
you you know you've used this before you
know where i was like can you just humor
me for a sec like can you just pull it
up
and they pull up this ui that i've never
seen before oh
oh wait now this is where you change
feature values
so stuff that's in prod and you don't
normally like this is not like part of
the push cycle
this is like you push your code
everything's deployed you want to adopt
a
change of value and this is the
interface you're talking about but you
who work regularly with this feature
stuff and chaos and all these other
teams have never seen
this interface before i had never seen
that interface before and i was like
this is a secret
shadow way of changing features that
is unknown to the chaos team and
demanding
i have been there two years at that
point and i was like that's kind of
i was like they were like what do you
use to do this and i pull up a
completely different ui and they were
like
i've never seen that ui which is very
pretty and modern
by the way it's got like a nice sorry
sorry and so i'm seeing like
but the one that i was using actually
had
um safeties right yeah yeah and so
it and so what we found out was a lot of
developers that have been at the company
like the svalbard had been at the
company a certain number um much longer
did not know about this new ui that was
a little bit safer
and the old ui i don't think a little
bit is a fair description i think i
think it was like worlds
safer worlds safer and the old ui was
running on one node
i went and talked to the team that owned
it they were like you know
oh wow uh people are using that
right and so it was like that was just
one of the most interesting incidents
i've i've ever experienced and
everything was
everything was fine like it it it all
was an okay
like situation it wasn't like customers
which means only a few million users had
incidents no no it was it was really
like it was like a very few
situations okay um it was just so many
nuances to that that
that particular developer if i hadn't
gone and talked to them they would have
just taken the blame for the entire
thing like oh i was being silly and they
really weren't
like they were right they hadn't
exercised that muscle in a while
they didn't know about the new ui so it
was like a communication breakdown you
know
and so it it spawned a lot of really
necessary conversations in the
organization and that
if i remember correctly i think that was
around the same time that the
hawaii guy got in trouble for the
missile command
and and for those of you who don't know
that i i don't remember all the details
of that but it was basically
he had like the password on his computer
and
i i remember the specifics but it was
like this this crazy ui that if you
press the wrong button it like alerted
all of hawaii and
that there was like attack or something
i felt so bad for that person because
he he didn't get fired but he got moved
on to another team and it was like
we've had problems with this guy before
kind of mentality but like when you
looked at the ui he was looking at it
was like
muscle missile and test or methyl
but like when you know hawaii is
reporting this they're like oh we really
gotta blame
yeah and i think i remember that mainly
because when you talked about this
incident the ui that that developer was
using
looked like that like it was one of
those like it went out of its way
i mean it was again it was an early
developer tool from the early days of
netflix going the crowd
the cloud but it was one of those cases
where like
doing the right thing was hard exactly
and this is why back in teams and
reliability teams
need more than back-end engineers they
need like product managers they need
designers right because
these like ultimately these can lead to
really big incidents but
a lot of times companies don't invest in
hiring like
differing skill sets for for them sure
they just throw a bunch of back end
engineers on the team
right and and now chat by the way is
sharing please don't share pictures of
all of your screens with sticky notes
and passwords
posted on them so we're gonna we're not
gonna get into that
um you can keep that but if you wanna
give me your mother's maiden name and
your social security okay
anyway um so moving on we're gonna that
that led to you said that there's
there's business in this
i mean it wasn't so that that led to a
lot of like business realizations for
netflix right and so
um after that i um i had spent like
the last six or seven months at netflix
really analyzing incidents finding
patterns behind them helping teams learn
from them
helping use that information to direct
attention in organizations
uh and it was a lot of powerful work but
it was a lot of like manual work there
wasn't a lot of tooling around it
um slack had been contacting me because
they were about to
ipo and they wanted to you know bring
some of this netflix mentality over to
slack in terms of chaos engineering in
terms of incident analysis
and so i went there for a little bit and
i was doing some incident analysis
around the incidents they were yeah i
was at slack for a little bit
yeah so when slack kept breaking over
and over again you're like you're just
feeding my degree here like this is
basically i mean it was like around the
time they ipo too so it was a very
interesting time to be there
um but at that point i was like there
are
there is such a need for better incident
analysis tooling to help businesses
a little bit more to like help them
understand their incidents
uh can be used like for opportunities to
help them direct attention in their
business help them understand their
friends rather than frying
individuals and you know yeah exactly
rather than firing people but actually
understand like
how some of this stuff is happening and
so i stayed there for about six months
and at that point i was kind of itching
to go
um start my my own company around some
of this like i saw
i saw a big need for for this and so
would you do
i did yeah so we're still in build mode
right now i've been working on it for
about a year
are you at the secret campus right now
is that is that why we don't know your
location
the secret campus right okay um
yeah we're still in build road right now
but uh we're we're basically building
tooling to help companies
understand some of these opportunities a
little bit better to help them
direct their attention so that that is
fantastic
are we allowed to share that web page
or yeah go for it okay so
i wasn't i wasn't really sure what um so
they're still
a little bit early there's not too much
description on this yet but
this is the the story this is if you
want to tell your
your full story uncover your full
potential then this is the jelly dot io
page and i'm just going to drop that
link in the chat this is
where nora is currently
getting things set up they're they're in
build mode right now and
i don't know i don't know looking around
doing
venture routes i mean that that's like
an insane amount of work to start up
this type of enterprise and then try to
try to explain that
chaos or you know injecting all this
tooling
around human factors is important for a
company to adopt and not
if they don't want the pieces all over
the place of their product
but they think oh don't don't break my
thing right i mean so so is that a
difficult like position to be in talking
to
the business clientele i mean or are
they more receptive
we're trying to meet people where they
are today like people know they have
incidents today
they want to do better at them we're not
trying to change things for them we're
trying to give them a boost
uh and so um also all if you if you sign
up on the site like you can stay updated
i'll share a little bit more once once
we are out of this build mode and like i
can talk about the product and stuff but
um yeah i'm not trying to change
people's mindsets i'm just trying to
make them
uh give them different insights into
their their incidents than
than they're seeing right now sure oh
that makes sense
um and insights are a very important
part of
understanding when things are wrong
right i mean that's that's all about the
visibility of what's
what's actually going on in your system
not what you think it's going to do
and by the way thank you for the follow
risky banana i appreciate that
um so in closing there's one
question that i usually want to ask uh
most of the people that are on the power
hacker series
and that is if you could
give a little bit of advice like one one
tool or
technique or book or and i know they
want to buy 100
signed copies of your chaos engineering
book that
you could give to them knowledge wise
what would that be that would help them
on their journey to
you know the bright side of hacking the
bright side of understanding complex
systems um
insights and i think i think your second
point is
so important for developers i wish you
know i studied computer engineering in
school and so i
like i had a little bit of background in
understanding complex systems but
there was there's not a lot of
dedication to to understanding system
failure
even in engineering degrees like we
don't get taught a lot about that and so
you know there's a really great book
that was the first book i read in my
program it's called the field guide to
understanding human error by sydney
decker
and he was a pilot and he talks a lot
about system failures
and how we think about them and all of
it is directly relevant to software you
could literally just find and replace
all the words aviation and change it to
software and it would read
it would read exactly the same and so
that book supplemented you know with
like designing data intensive
applications
and um the pragmatic programmer i think
are like
great books yeah i think are great books
for for the programmers tool chess but
like
a lot of folks don't don't know about
the field guide to understanding human
error but i think
as developers like you get a leg up if
you start understanding
the human piece of the equation rather
than just the code
interesting and i do want to apologize
to risky banana that was not a follow
that was a sub i'm sorry i missed that
so
thank you even more you have some of the
best programming emotes including
the no regex emote which how many
incidents did we see i mean just just
ballpark that might be related
to regular expressions is that am i
it's too many it's too many incidents
people in this channel are always asking
me why i'm so anti-regex and i try to
explain to them that
it's not the regular expression itself
it's the fact that it's a super dense
piece of code and it's very unreadable
and no one takes the time person that
wrote it
understands it and even they don't
understand it like a day later
you go get blame and it's like someone
wrote it like 12 years ago that's not
even at the company anymore and you're
like what am i supposed to do with this
and this is real this took out netflix a
number of times that i recall
but i have seen outage reports from
cloudflare
from google from everyone has had a
major
everything went down because of a
regular expression anyway i know
that might not be in your book i mean i
want i'll write that chapter next next
book and actually you have another book
you just you just came out with another
book right yeah that was
you you talked about that earlier oh
okay i'm sorry that was that was the
second okay that's that's my bad um i
had that i had the signed copy right
right that was my
my early one okay that's that's super
great um
any other parting words you have for uh
chat here it sounds like after they're
done posting their passwords from their
stickies on their monitors
um anything else you want to tell them
this is by the way
nora jones first appearance on twitch i
believe
it is yeah so any anything you want to
tell them before i
tell them that we're sort of done with
what we're doing uh
no thank you for having me and thanks
for the question oh it's been a real
pleasure
to have you here and and last miles is
very adamant to talk about a signed copy
for a hundred bucks
um i think that that last miles that
might cost a hundred bucks just to ship
it to canada
as far as i can tell i don't know what's
happened to the border but
um for those of you who want to harass
me
about those books feel free to join the
discord
and please don't bother nora jones on
her email sign up page about that but
that
also is a hacking technique you can come
on the discord and chat
nora is not on the discord yet
but maybe we can talk her into joining
our discord community as well
after she's done a power hacker series
so oh i i did want to mention one other
thing
i'm sure we have a website called
learning from incidents.i
don't know where where folks are all
over the software industry kind of open
source their
learnings from their their incidents and
how they're improving and learning from
them so
um it's a fun site to go check out so if
i understand this correctly you really
love the
indian ocean is that right what
i mean io right it's indian ocean right
oh yes
yeah so do i have this right learnings
from
incident learning from learning from
incidents from
incidents
okay that does not look like the right
thing
uh learning there you go
oh that is the right thing that's not
that's not it either uh i will post it
in the chat
okay well yep drop that in the chat uh
for all of them
and i do very much want to thank you for
coming on the show it's been a pleasure
to have you
and thank you everybody for giving nora
such a warm reception
and uh generally being a great audience
so uh
that is uh yeah this is episode three of
the power hacker series
come join discord these happen on
wednesdays in general
um maybe we'll get nora back someday uh
she was originally going to host
uh my my charity run
which i which we haven't talked about
rut yet but
um that that may be happening again
ms jones but i know you're super busy
with startup mode so i won't i won't
bother you about that this year
but and it was super fun when we got to
work
like a cube or two apart in uh in los
gatos so i do miss
this that was a lot of fun i miss it too
um but anyway
thank you so much for coming on the show
and thank you everybody
for being here and being a great
audience so we'll see you later and
next up is going to be saturday is the
launch party i actually
nora you can inject some chaos into this
but i'm actually about to launch
a original nintendo game on
saturday cool and it will people that
are going to be coming to the launch
party
can cash their channel points in to be
able to get their own personalized copy
burned into the nes rom so that will be
happening on
saturday so everybody drop by on
saturday i'll talk more about the
details of that in the discord
but nora thank you so much it's been
great
thank you guys thank you everyone take care