Topic 11 - The Missing Parts 1: Hashing (Dictionaries)
Outline:
- hashing map keys (strings, ids, etc) to indices
- computing the index should be fast
- use the index to lookup a value
- not for cryptographic use (reversible hashes are ok)
- minimize collisions, overflow is ok
- add a modulo to reduce the space
- dealing with collisions
- make every entry a bucket of keys (linked list)
- search next hash cell for exact match (or reverse)
- simple hash functions for strings
- length of the word - easy to compute, prone to collide
- sum of the letters - less collision prone
- modern types djb2 hash without shifts
- a tour of hash functions
- future - linked lists can win for small sizes
Transcript:
hello everybody how are you all doing
we are here back learning c we've had
enough of this
bingo nonsense that's been going on for
the last week
and i've got uh jesse on the line jesse
are you there
hi everyone all right she's there um so
tonight
i have a slight um change uh
i was originally saying we're gonna do
hashes and sorts um we're just gonna
focus mostly on hashes so
uh sorry to mislead everybody
we're gonna we're gonna do a bit mostly
with like hashes so
where we are now is this is the end of
sort of the c
like learning c as an additional
language course this is
uh i call this the missing parts so this
is the stuff
these are four sub-topics that all
fit into kind of like just doesn't fit
into anywhere else
they're they're important things to
really doing anything with c
but they're not they're not they don't
conveniently slide into syntax patterns
or anything else all the
all the basic syntax stuff jesse she's
mastered that she's like expert c
programmer now so we're just not
worrying about that
i will see nathaniel i i don't feel like
we missed national scotch day i feel
like we're just going to have to have
a national scotch later
yeah so that's what we're going to do um
harris is this stand-up comedy no
god no okay so without further ado
let's um i'm just going to cut real fast
to
let's see what we had was this thing
and this is our
oh bingo because yeah we're just doing
bingo so for those of you uh
i know 715 209 has been watching all of
it but
we've been building world scale bingo
we're building this i i actually just
got to the part which i thought
jesse would be interested in last night
and i called her out on the stream where
i said i
i hit a memory leak because that happens
i i rewrote this thing to be in c
because you want to scale up to the
whole world write it in c so i wrote it
in c
and i hit a memory leak and i sort of
ended the show i was like you know
maybe this should be like jesse's final
exam like
figure out this memory so don't worry
about debugging
like weird freak programmers it actually
prefers debugging to actual coding i
i have a feeling that after uh getting
into apache debugging
you're gonna take that back but it's
like
exercise where you're miserable and you
hate yourself while you're doing it and
you question everything you thought you
knew about yourself and what yeah
and then when you're done you're like
right that's true that's why absolutely
true that is absolutely true so
um this yeah we got this up to handling
about a billion people so that's fun if
you want to catch that i'm gonna be
finishing that stream up uh later this
week
so so that's really cool actually so so
yeah we started with a python app
actually i was kind of thinking of you
because
you know we wrote it first in python the
first day and you know we managed to get
it up to like
715 what did we hit like around 500 rps
with the uw
sgi version which was which was pretty
solid so 500 requests a second
pure python we were done in like one
session it was beautiful
and then you know somebody threw a
challenge at me like why don't you deal
with like 21 million users or
a couple billion i was like you know why
don't we do it for the whole world
forget 21 million so
been rewriting in c and then you know
memory leaks and everything else that's
fun about c
but it's it's so funny that you say that
because like one of the hardest things
about
well not the hardest but one of the
difficult things for me about this
pandemic and everything has been
i can't volunteer with the fire
department and the lesser fun thing with
the fire department is we get most of
our money to fund like our new equipment
and stuff through bingo for the elderly
population of our community
and we've had to shut that down and
that's like our main source of income
so you say bingo and i kind of just go
all of these memories of dealing you
know if you want a tough crowd
75 and up wow they throw water bottles
i think thrown like stuff at the bingo
callers like they're
intense i'm making a note because i
think we we're gonna need to talk about
that
throwing stuff at the bingo hall all
right so we'll we're gonna get to that
as one of our toppings so
um okay um so yeah anyway that
i'll finish that up this week so where
we are we're on c this was actually what
i meant to pull up
and we have we have jesse here and
um so this i originally had we're in
topic 11 tonight
i was gonna i'm gonna break this up so
this will be tonight we'll just focus
sort of on the first half here which is
gonna be
hash tables and dictionaries to somebody
coming from python um
and uh sorting i don't know we'll see
it'll be kind of the reach goal for
today we'll see if we get there
um otherwise uh this other stuff in the
last session
i think randomness and map pretty much
deserve their own
topic so we might break those into
separate sessions but anyway
good to see everybody and let's get
coding
so enough enough this bingo nonsense
yeah we
see my screen all right that helps all
right so um
i'm gonna go get the the glove of
drawing
and we are going
to talk about hashing so i'm gonna just
cut over to the whiteboard
or if nightshade dude was here the
blackboard because it's not actually
white
all right so what are we doing hashing
in sorts
um this is going to be like sort of a
dotted line around that
so we'll we'll worry about that later um
and so hashes
what is a hash let's throw a big
question mark on that
um this is a topic that
uh now again i'm not trying to get too
far into like the pure cs realm so
there there's there's a whole art to
hashing it's
almost more art than science but there
there is some solid science behind it
um this is about
this is one of the ways i'm and i'm
going to go and make
a very argumentative statement here so
prepare your tomatoes prepare your
eggs just get ready to start throwing
stuff
i'm prepared one of the primary
functions of a computer is to
find stuff and once you've found stuff
sort of like do something with it so
and by that i'm really specifically
referring to some kind of transformation
so most of what the computer is doing
is either finding stuff or doing
something to that stuff
and that pretty much sums up everything
that a computer does
this this is more on the computational
side
you know so you're you're doing some
sort of you know math to it
and this is more on the ordering
and finding side
and having just watched hidden figures
and remembering that uh which was very
fun if you haven't seen it i really no i
really want to see that that's on my two
watch list that
i i recommend it um i think it's very
interesting um
because back in the day you know a
computer was a person a person
yeah it was it was usually well this is
not actually going back it was
it could be anybody i mean it was just
somebody who was
amazingly fast with arithmetic and the
period of hidden figures yes usually a
woman
but that's interesting because that
draws attention to this second function
because this is where we picked up the
machine
and we use the word computer but i i
want to get into this because you're
a linguist and i think it's funny that
in spanish they chose not to do that
they went with the other primary
function of
a of a computer which or what we would
call a computer
and they basically call it an orderer so
the word in spanish is ordinary
yeah which is which is kind of fun so
that's that's someone who basically
orders
these are your two primary functions of
computers now
we've been doing a lot on the something
stuff so we've been doing
transformations we've been ripping
through lists we've been doing all the
stuff that you do in c
parsing whatnot but we have done very
little on the finding stuff
so dealing with large sets of data
dealing with
things that we need to go looking up
retrieval you know
sort of putting order to things um so
one thing really quick
yeah i find this very interesting from
from a linguistics point of view is that
the etymology of computer actually comes
from
latin putare which is to think and so
when you
when you kind of hearken back to sort of
like cocodo ergo soon
it's a very similar idea of like i am
it's also
used for the word it's depending on
context can be used for the verb to
count as well yep
which i find really interesting because
i mean i'm i'm i'm
doing some hand wavy stuff here because
oh i am too
individual thinking where is like con
pitare and latin is more about like like
uh
collect like thinking together so cow
calculating literally which is what
you're getting at but i still kind of
like
even if it's a bit hand wavy the
connection between descartes kind of
saying you know i think therefore i am
and potare being to think and being the
root of your computer because i find
that you know
compute is i compete therefore i am
which is like a very cool
like it gets into some some some very
futuristic sci-fi uh leaning
uh conclusions but i find it to be a
very interesting linguistic connection
yeah actually cantonese uses electric
brain
to your thinking point so okay yeah
so all right i'll i'll put that on the
list let's let's yeah we'll we'll get
back to our business but let's let's
let's throw that on the list
we'll say uh uh electric brain
and computer etymology um so
but importantly i do want to get off the
doing something transformation stuff
and i want to talk about finding and
that's mostly where we're going to be
dealing with hashes
so these two topics i had them put
together hashes and sorts
because they're really all about finding
stuff and finding it quickly
so you know why are we usually sorting
things we usually are just ordering it
so that we can you know pull something
out relatively fast
sometimes just for neatness or you know
but typically it's for retrieval we're
organizing it so that can be retrieved
quickly
hashing is related to that um now
they're related in the sense that
they're both in the general space of
finding stuff
but they're not they don't work the same
way and in the case of a sort
i'm sort of taking everything and
putting it into order and then
a certain given order right yeah yeah
some sort of correlation order kind of
previously decided upon order yeah yeah
that's
that's formally called collation um
so not worried about the terminology but
you know there's some sort of co
like sorting strategy you know like so
it could be alphabetical it could be
numerical it could be biggest the
smallest
smallest the biggest whatever the
collation is
hashing is about trying to make the
retrieval
instant quickly based on sort of a
numerical recipe um so this is kind of
like
shifting more onto the function side so
whereas this is sorting is kind of more
on the uh
you know ordering side so let's let's
put everything to order
but on the function side we want to find
a way to kind of
quickly pull that information up okay
so nathaniel to your question we're not
going to get
into data structures too much in the c
course i was thinking about doing it as
maybe the next course but so
um yes please we can we can go deep onto
data
structure so for for this segment on c
i'm really just going to go very light
on hash and sort and then basically just
hand wave that there's a massive field
behind both of these
and we can we can do that in detail if
we want
but for now let's talk about hashes so
the most common case that i want to kind
of pick at for a hash is you have a
large list of stuff
and it's typically organized as some
sort of key value relation
so and these don't need to be the same
data type so
the key could be like some sort of
number like this is
i mean they could be ascending one two
three four it could be a name
so here's jesse here's guy here's
nathaniel
you know it could be we don't know what
space the key is going to be in but we
do know that we want to be able to
retrieve a particular record like say we
want to look up guy
and actually actually even better is
that chunk
let's let's have the chunk entry right
you know so
let's say that we want to go pull like
chunk out of the library like
and then we want to get sort of the
corresponding value so let's assume that
these are all organized nicely
so if i find chunk i want the stuff
that's in here
and i'm being very loose about what the
key and what the value are
because it really depends on whatever
your application is so um
in the case of like a planet it could be
something like you know
mercury and you know the orbital
period you know so we want to have some
sort of fast retrieval for
the orbital period so it's i don't know
88 something
i can't really remember yeah i don't
remember exactly um but anyway so
you know we have some sort of
relationship between a key and a value
um and how do we actually pull
mercury up well one way to do it is we
could actually just
look through every single entry
and this is kind of the way we did it i
think in one of the earlier courses we
just said
all right start at the top of the list
is this mercury no is this mercury no is
this mercury no is this mercury no
no no no no yes okay then it's here
how long does that take well it depends
how big your list is right
fair enough yes yeah so if i'm writing
down what you're saying
yeah i mean if my list is really big
this i mean so if we're dealing with
like
a list the size of like the number of
planets no big deal i mean we can we can
rip through nine entries very fast on a
computer so we don't need to be clever
about it
but if we're doing something like um
i don't know social security numbers for
example then
this might be a very large list right
this this might be like on the order of
300 million
and looking through them not all of them
are valid right you know because there's
people who've gone
past on there's there's numbers that
have been reassigned for identity theft
whatever it might be like this is going
to
take quite a lot of time if i have to go
through 300 million
entries just to find the one that i'm
looking for we don't want to do that
that makes sense
okay so efficient searching so yeah one
way we can deal with that is we can put
all of these into sorted order and then
we can
sort of use a dictionary type approach
like you know how you learn to open the
dictionary you say oh i'm going to open
it to the a
you know that's going to be towards the
beginning you know or i'm going to open
it to like
f because that that'll be like i don't
know a little bit in but not it's still
sort of the first half or like something
like z which i know is going to be
towards the end
so that so sorting this could cut down
the search space but
it would be even nicer if i could jump
directly to the entry that i want
because i know that i'm going to be
looking it up by key
so i want to organize the information in
such a way that i can retrieve it very
quickly by key
so you want the thing that you're
searching for to be
in some sort of subject subjective
first number of entries given how you've
sorted it yes
not necessarily at the top but just to
like improve your
global searching to have it be in like
you know the the first 10 seconds or
something of searching
probably something much faster but just
arbitrarily
yeah okay yeah and and love bone sorry
about that is that
is that fixed i i i turned it down
because i can speak up
no no it's i i had i was blasting some
music over the
weekend so anything good
no no no no nothing oh i'm just saying
because i've been blasting like lots of
broadway lately
oh yeah actually it was hamilton there
you go see
yeah i mean on stream i was blasting
some
stuff that was whatever so um
anyway uh cool luffbone says it sounds
better all right so
um great well how do we do that well
what we want to do when we're
retrieving data normally with something
like this it's stored in an array
so i've got an array and say i've got
like
five entries zero one two three four
okay so if i want to pull if my key
is this number then retrieval is very
very quick because i know
for example if these are all fixed size
then i can do some very fast math i can
say okay if i want the second one
all i need to know is how big this is
you know one of these things you know so
i could say like if this is
size like say this is four bytes then i
can just multiply four this is four
bytes
zero four eight okay and i get there by
just multiplying this by however
many that is so the computer's very fast
at doing something which we studied
earlier
and we pop out the number very quickly
and we're able to just read
the entry okay it's not so easy when
this data
is not numerically ordered in this
fashion it's like zero one two three
this is nice there's no there's no holes
in here there's no space between one and
two
there's no missing you know like i i
don't need to worry about like where's
guy
on this list because how do we put that
into a number right you know this is not
you're interested in the index
not necessarily in the value yeah and
retrieval then
is super quick right because when i go
and i look in this array if it's called
arr
i can just type arr you know two
and it's going to pop that thing this is
very very fast access
so the problem with typing in something
like arr
guy which i might want if i'm trying to
look up
something about me like say for example
my stream key
so um so i'm writing twitch then
i might have this organized with labels
like this
but i can't index into a location in
memory based on this
value unless i use a hash
and that's exactly what hashes are
generally for so
hashes so they're labeling like indexing
it's it's
indexing it's a way of taking this label
whatever it is and turning it into an
index so
putting that um slightly
slightly more formally and in this
type of orange color this would be
i want to take some sort of function
over
some data you know some sort of key
and i want to end up with some sort of
index
that chord excuse me that corresponds to
something that i can use in a list
so this is in in essence this is what a
hash function
does so it it's responsible for
tran you know moving from here to
here and then once i have this i can go
and
index it into an array and say like okay
arr
sub index so one way of doing this would
be something like you know for example
if we had um
you know jesse and we had guy
and we had chonk and we had nathaniel
nathaniel i'm sorry you're after chunk
one way of dealing with and don't get
into trees nathaniel come on
um one way of doing this those i have
done unfortunately would be doing like
some sort of look up table so i could
just say
all right jesse is one guy is two chunk
is three nathaniel is four
so i could just find it in the lookup
table and maybe this is less i'd still
have to go looking through each entry
until i find it so this wouldn't be
terribly efficient
to figure out right because i'd have to
go look in this table to get this index
to be able to look in this array not
super cool right
right sorry i got distracted my chat i'm
sorry
humor yes i'm following you okay
um so what we want to do is we want to
take
um this this key
in some sense and we want to transform
it into a number
now we don't really care what the number
is we do
care that they're not the same so right
we need
unique caches yeah we want our our hash
to try to be unique
so that because when both of these are
the same they sort of collide on the
same index and
i could build my array so that it's
really an array list
you know so here's the index and then
entry zero and then
you know there's also entry one and you
know i can make like a sub array
and then i could look through each of
these and say okay is this one jesse
yeah okay cool i was looking for jesse
so that's that one then i can go
look whatever but that's more
work because now i have to go look this
thing up and then i have to go
look through this sub list now
unfortunately
there's not a lot of perfect hash
functions
but quick question yeah so you've been
describing you know this this
transformation where we have kind of
like the input is a key and the output
is this index
but the the assumption right if i'm
following you is that
that index will reflect like our
priority as the searcher
so like it will in some way no
no so because then what what is really
the difference
between the key and the the index in
this case if it doesn't
change it into something that is useful
for us in terms of
this is a location in memory it's an
index
this is a key in our key space so this
might be a string
you know in this case and this index
is something in a memory location we
don't care what order this is in
what we care is we have a way to turn
one of these
into one of those we have some sort of
function that gets us from here to there
don't we already have that with its
memory location yeah but we don't if
if if i say array of
chunk what memory location is
that well we have a don't we have ways
of just printing that because that's
what we've done before we have very easy
ways of pulling out what that memory
address is right
with like pointers and stuff this is
first of all this is not valid notation
i mean i i well yes yes this requires a
number i mean it has to be
like zero or something you know it has
to be one that has to be two
you can't see so you're looking for if
you know the value
hashes help you find an index in a way
yeah not value if i know the key
so the key here is chunk and i want to
look something up with that chunk
so in in the more in the in the more c
sort of way of doing it
your key your your index is like your
zero one to your your
let's let me change how i'm describing
this that's that's
your placement in the array yes and so
if you don't know the placement in the
array but you know the value
you know the really value what do you
mean so
think about like uh the value being
sorry
when i'm used to talking about arrays
and math like you have your
your position sure but then the the
thing that occupies the spot at that
position it has a value and that's the
value at that position
so like if if i have this array an array
z
like say array five array position
actually here let me give you let me
give you a simpler example value is
chopped let's say that there's something
that maps
uh a shape to the number of sides okay
so
triangle has three okay
square has four
rectangles has four
pentagon has five okay so there's an
association between like
this some kind of key and some kind of
yeah just to clear up
like terminology this is the key
this is the values okay
but you're using like the python
definite like uh dictionary use of key
and value well
python's using the much oldest i mean
for my reference that's what you
use yeah this is key this is value
where's index in this
i see okay it's nowhere right i mean so
that's kind of my problem is what
location of memory is triangle in what
location memory square in
rec pentagon what if they're not in this
order you know what if it was like
square triangle
because there's no notion of order to
this right
it's whatever order they're in this had
just better be
right you know so this had better be
three this had better be four
but what order they're in is kind of
irrelevant
right yes i mean all we care about is
that when we look up triangle we get
three
and when we look up square we get four
okay so if i'm understanding this right
hmm okay yes i think i'm following you
okay this is talking about key and value
it's not talking about index i guess the
reason why i'm struggling is i didn't
from what we've learned in c so far you
can't do an array of like multiple data
types like that
and in fact you can't do an array okay
we've never done that yes we have we've
used structs
we use those arrays technically no we
can make an array of structs that's what
we did with our planet example
so we had we had a planet name like
mercury i guess i thought that was more
of an array of like pointers
two strucks yes that's an implementation
detail but
yes i mean we effectively had a data
type that captured a name
right so so for the planet we had a name
which was you know some sort of you know
character thing so
it's like care 20. and we had
an orbital period which was some sort of
like unsigned
or whatever integer and that was like
o period so we had exactly what we're
talking about before
right i see what you're saying okay we
had a relation between these the problem
was
we had no way to jump straight to one of
these names
right so if we put all of these into an
array like we defined it as
struct planet
planets and we said there were nine of
them
then we don't have a way to jump
directly to
say mercury or a way to jump into unless
we knew
the quote-unquote index yeah we would
have to know the index and so the only
way we could do it back then
was we would basically one at a time
have to go through this
list and we would have to say you know
we need like some sort of for loop
you know four basis to iterate through
them all here i'll just borrow math
notation since
yeah more compact so for each planet
basically
um find the one that matches uh
name you know name is
you know the thing we're searching for
and
when we find it return the associated
orbital period
so we we have dealt with this sort of
like
you know concatenated data structure um
but anyway this wasn't too bad when we
were going through all of the planets
because there were only nine of them
right so we just to your point when
you're dealing with like three million
or whatever
it suddenly becomes prohibitively large
yes yeah it's not okay
this is our second problem of computer
science basically like you know
computers
not in the computational sense we have
them in the ordering sense
right because now we need to go now it's
finding the data
is significant before it wasn't like in
this list finding is like nothing
it's basically free so
but in the case of like 300 million
entries or whatever sort of big data you
want to work with
finding is not free
so we have to be very careful about how
we do it and that's basically the space
that
um hashes and sorts occupy all right so
that's that's our rough intuition
are we safe on that so far we have our
terminology straight
keys and values different from indexes
where these things land in memory
is totally irrelevant we don't care if
jupiter is stored at this location or
that location we just care that
when we find jupiter it has the value
that jupiter needs to have for orbital
period
right so it's not so much its index is
almost irrelevant
but nonetheless we need to know it so we
can find it yes it's important
once i put the stuff into a particular
memory order
but it's not important in the sense that
we don't really care which exact
location of memory it occupies we just
care that we can find jupiter when we go
searching for it
or we can find triangle when we go
searching for it okay
so that i just wanted to clear up the
terminology i think yeah yeah i'm
following you
and uh uh marcelo von
thank you for the follow oh um
so yeah let's let's just jump back here
so let's switch gears
to this sort of strange green thing um
so great
oh this is actually kind of fun this is
a little bit radioactive so
um at least to me it matches the color
that my name is in chat so there you go
oh
hey that's funny because you're showing
up as purple in my chat
i'm like this sort of like pea soup
green
okay lime green all right that's neat um
so hash this is a way of dealing with
that so
um the idea being that if the data
you know data does not need to be or
so i shouldn't say that i should say
keys in particular
keys not organized
so they could be in any order right
we don't really care what order they're
in we just care that when we
do the look up it's fast right yes
we don't care the order we just want to
find it quickly yes we just want to be
able to find it quickly so the keys are
not organized
um there's usually associated values
with the keys
and this is typically the point of our
hash function is
we we typically want to get the value
now hashing itself
refers only directly to the key it's an
operation to find
where basically it's it it wants to find
the index
find the index of
a given key
so hold on and this is i this is me
being like a little bit of a
like a little pedantic with with respect
to math is that
like if f of see if f of key
is hash technically what you're talking
about is the inverse
of f
fed the hash results in key yes
so like no okay so which is it
that no no it's it's no because they're
not it doesn't have to be one to one
so that your inverse doesn't hold true
what do you mean it doesn't have to be
one to one
um we can have hash collisions
so we can we can say like what
yeah yeah basically the whole point of
it that it doesn't do something like
well that that's what we
want the problem is in the real world
it doesn't always act perfectly well
so um so one of the most common patterns
that you have to deal with is
how to deal with this notion of hash
collisions um
and that's the reason why your your
inverse is not correct because in order
for it to have an inverse it has to
be onto and have you know one to one
right
mm-hmm yeah for biject for bijection for
like the definition of a function
okay so um that we don't necessarily
have that for hashing
that's that's the notion of a perfect
hash by the way
oh that sounds nice so if we have a
perfect hash which sometimes we do
then it is invertible in in the sense
that you just mentioned
um but hashes by by default
usually aren't perfect
so okay and so i'm sorry if this is
super ignorant
but then what the heck is the point like
the entire at least from from my sort of
perspective of you
in introducing sorry stroke yeah
it seems to me quite
confusing and and a bit frustrating
that like you invent a thing to make
sure that you have this like a unique
identifier for keys so you can look
stuff
up quickly and yet somehow the way that
it's been commonly implemented
breaks the very thing that you were
looking to build with it isn't that kind
of
like not quite like you know what i'm
sorry is there something i'm missing
you do it better you're you're not
missing anything what i'm going to tell
you is that you're the programmer
so so you're gonna be doing this
so if you if you want me to add the
restriction that you're only allowed to
implement perfect hashes this might be a
very long class today
you're fucking me i'm talking somebody
who's way smarter no no
this is the problem basically there's a
lot of spaces
some of them just don't have perfect
hashes it's it's
not necessarily mathematically
guaranteed that a perfect hash
exists but i i can tell you this like to
your point
yes software engineers try to do what
you're saying
like i mean because i understand the
cryptographic hash and like
also not perfect by the way also not
perfect but i understand why
that is is difficult and like
it's because you are intentionally
trying to obfuscate the inverse of the
cryptographic function but in this case
an inverse shouldn't be a problem like
in fact that's desirable
and so it kind of seems like on the one
it's like the whole problem with the
cryptographic hash right is that
you don't want it to be super easily
reversible
well all right what if i told you
shouldn't you want it to be really
unless
it depends on your domain but i can tell
you that that
it it like all things in computer
science you're usually trading stuff
so the typically this isn't 100 true but
typically the better the hash is
the more expensive it is so and we're
going to get into that in a moment
but um expensive from a financial or
computational no it's it's
expensive is always for our purposes
always going to mean either something
computational
or something memory wise
it's either speed or memory that we're
typically trading um
so you sort of want to get by with like
the quickest hash that works pretty well
i see what you're saying so so it's
almost like you have this this this
balancing act yeah
the memory that you use and the speed of
the search yes and so perhaps that you
could make this perfect hashtag then
like how do you make it not slow which
is
feeding the entire purpose or maybe one
programmer makes
you know a program that requires five
gigs of memory and another programmer
does it in one gig of memory you know
like
this is one of those places where you're
trading that off
so like do the gigs matter you know like
is it
is there only one machine or is it going
to be running on people's machines that
only have four gigs
yeah are you microsoft and you have
super computers and all you care about
is speed yeah
or you know are you like me where
microsoft cares about speed
when did that happen okay all right
anyway that's that's
good okay sorry i don't know it was the
first like
that that giant i do not think of
microsoft there you go
video cares a lot about speech yeah
actually i would say they they go
far on the other side anyway all right
so yes
that we can get into that in the second
segment so um
anyway let's talk about hashes so what
are the ways that we can
sorry chat oh
chat today thank you chad you're
wonderful okay so how do i design a
hashing function that takes something
like
triangle and outputs something like
seven right because i can use this as an
index
and then i can take something like
square and i can output something like
three again i don't care what these
numbers are because they're indices
so i just care that they're hopefully
not the same
right so um that was a really bad equals
sign but that was you know this is
here i'll i'll do it oh it's getting
worse
all right so um so how do i how do i
actually do the mechanics of this how do
i take something like this
and turn it into something like that
what what sort of ideas like what would
jump to your head
oh god like not
not very advanced things like i mean i'm
reminded of elementary school when we
did all of those like shift ciphers and
stuff
yeah those words right actually that
we're going to be using a variant of
that so
yeah maybe maybe you're too smart for
this class
so i should warn you that my my dad uh
he
he is a linguist as well and his focus
in his career largely until
like the last few years was um
cryptography
so i grew up largely around like these
ideas of like how do we take something
in terms of like how do we take a
language
a given language word so a string and
how do we convert it
into something that is
truncated from a memory perspective but
then also like
in the case of numbers like sortable
things like that so
yeah it's a very fascinating idea but
also
cryptography is really cool that's
actually
hashing theory that's not true we'll see
i've never
heard the cryptographic now i'm not
concerned today we're not really
exploring cryptographic hashes because
crypto
has some additional constraints that
it's going to throw on the hash which i
don't want to talk about
right now but we we'll get to that in a
little bit let's i think though at least
for like
given that that's what i'm more familiar
with the understanding of like
converting having this unique hopefully
unique identifier
and converting sort of a given
data packet if you will into something
that is a an identifier of sorts is kind
of what we're doing here but instead of
like
a file and an identifying hash we're
talking about
you know a key an index or a key and a
hash
yep well we want to get a key and then
we know that once we find the thing
we're going to have some sort of value
that we're going to pull out
and the value so here i'm just using
sides you know so this was like 3
4 4. and by the way don't read what he's
saying because we're going to get to
that a little bit later
um it's like a bad idea so
what well one way we could do this
by the way i say this only because like
we've been dealing with such small
pieces of data in the c-class that like
there are so many things that could add
up to like 16. oh
don't worry i've got some big data for
you tonight oh boy
that's like that's not to be worried
identifier data that we're drinking no
no no don't worry
about that so okay all right how about
this for a hash function
how about the number of letters is the
hash function
okay all right so square you would get a
lot of
hash collisions depends on the key space
so in this key space this would be a
five
right what do you mean by key space uh
this
key space these keys okay so in this set
of keys yeah in this set of keys
um so square would be five rect would be
four this is a really fast hash to
compute right
yes pen tag would be six that's why i
didn't write the whole word out
the triangle would i want this to be a
perfect hash for now so i'm rigging the
problem a little bit but
um this would be eight okay
i can use these as indices right
for now i mean so if i have an array
somewhere else right and i define this
is going to be
you know some sort of struct array or i
could do it as
two different arrays i mean we could
have a key array and a value array
it's that'll be up to you how you want
to do that um
but let's let's just assume for the
moment that it's a struct thing and and
you know this will be
like polygons so we're gonna just call
this like poly
and in there we have a name
which is you know i don't know
we'll give ourselves 10 characters and
you know a number of sides which we
could we could stop we'll just use an
integer we won't be efficient so
um and this would be like how many sides
and if i go and use this
struct for my array
i could say you know we could call this
our database
of all of our data
and let's say that it has to be some
size the size that we're going to pick
is going to correspond to with our our
hash function
so we want to use as much space as our
hash function turns values out
so let's say that this thing we'll just
make it big for now we'll just say that
it's a hundred
um that's we're not gonna be dealing
with any objects that large but you know
i'm i'm padding the space a little bit
so and then what i can do is i can say
like you know assign
poly let's just assume we have this
mystical function which would say
db um you know
at index 8 is going to be
triangle
and it has value three
okay and then this assigned poly is
basically going to conceal our silly
little hash function
which is just going to say that this
thing ends up
well actually sorry i already calculated
the hash which is right here
so right because triangle was eight in
this
number of letters right
okay um so i can put all of these things
into this database and if i use the same
hash function i can look them up so i
could say like
lookup triangle
and all this would have of course being
functions that we ourselves have defined
oh yeah we're gonna have to we have to
write these yeah i was gonna say i was
like
and and so lookup is basically just
gonna compute the hash right it's gonna
it's gonna say all right well
triangle i i have to write this function
it's one two three four
five six seven eight that's an eight so
pull db sub eight and that's gonna have
you know our record of our poly all
right does that make sense
i think so yes okay now this is not a
particularly good hashing function it
was good over this set of keys
but the minute i get another polygon
that has the same number of sides
i mean then you're gonna have your
collision yeah i'm gonna have a
collision so
i can write my hash function to be sort
of smart about this like by
treating these as buckets which is
something strophia mentioned i'm not
gonna get into that tonight that's sort
of a
it's a little bit more advanced let's
just assume that our hash functions are
all perfect and they never collide
okay as our initial simplifying
assumption
okay um so this is
a naive one another relatively naive
hashing function that we could use is we
could actually treat all of these
letters as numbers because
we know that they're stored as ascii
values right
yeah so that's kind of the the shift or
like you know you're yeah
there's a really as a kid yeah
so what i can do is i can say add them
all up
right actually i'll get really crazy and
i'll use a sum so um
so i could use a simple add function and
say that the value of all of these
things together
is the value of the hash
so that would be
that would be interesting in a language
that has
more i would say tends towards more
unique
combinations of letters where as opposed
to english i feel like
we are like there's a reason that
anagrams
often work in english but there are
definitely languages where anagrams
don't work as effectively and so
um it's sorry that was just like no no
no no that's really interesting like
this is a much better uh no you
mentioned hash function than previous
it's definitely better than the previous
one it's gonna it's gonna work it's
gonna be
less collisiony um yeah
collision prone yeah it's gonna be less
collision prone but um
it and and that's basically kind of what
we're concerned with is like
when we're evaluating the different hash
functions that we pick
how often do they collide right that's
that's that's really what we want to
know is
you know for a given key space which is
you know set of keys
like how much does this hash function
collide in that key space so in this
case we're using
sort of words you know in this example
it's words from you know geometry
um would be our key space so it's not
that big a set of words so the odds of a
collision with a reasonable hash
functions
you know pretty small um but
something like uh you know names i mean
some of these things you might not even
be able to get around like you know if
you have two people named john
then you know it's gonna collide
[Laughter]
it doesn't matter what hash function you
use like you're gonna end up with two
john so you might need some other way to
disambiguate your entries so that
that's that's just an important little
side note like collisions are sort of a
fact of life and potentially so this is
just like
again please don't make me do this
myself
but i could also see that if you are
dealing with
you know data structure structs
specifically
that you could potentially use the
combination of various struct
attributes in your function to kind of
create a multi-variable hash function
and then you probably would have a
lesser likelihood of some kind of hash
collision because you'd be looking at
unique combinations
of multiple variables yeah
that sounds reasonable but don't make me
do that
that sounds sweet very simple one please
can i count them
we're gonna start with the simple one so
let's let's do let's start i'm just kind
of you know daydreaming a little bit
yeah i mean
let's start evaluating these so for now
i'm not going to worry about looking up
the related value
suffice it to say when we have a good
hash function
looking up the related value is easy
right the whole point of this
is to make it easy to kind of index into
an array so for for now
we're just going to calculate sort of an
index
and let's write that as a function so
we're going to pass it
a string and say to it like hey i want
you to return
whatever the number is that's going to
be this hash and for the number
let's let's have it always be numeric
because that's nice and easy to index
into an array
actually this is not the right way this
is the right array
but um as long as we have enough
memory space for this so um
let's assume for now that our hash
function
just needs to return uh a character word
so like you know
a character is how big
are you talking memory yeah um
a byte eight bits yeah it's eight bits
so if i
if i assume that my hash function has to
return a care
um really i'm just getting at the fact
that i want an 8-bit number which you
know
using the new modern notation would be a
u int 8
underscore t um so we can actually write
it this way and make last miles happy
then let's write that hash function and
let's just write
our our first one which is really just
going to be um
number of letters and we already know
that we can
probably do better if we do some of
letters
so we're going to write that too um and
we're going to need a small set of
um i'll give you a couple tests to kind
of work with we can we can use the
planets actually that's probably even
easier
all right sound good yes all right and
then
once we get through that i'll show you a
better hash function
okay all right cool so quick sort of
conceptual question before we get
started um
and again not not sort of thinking about
the scope of how i'm trying to learn it
today but sort of in that wonderful
world where
we are trying to make our ideal hash
function
doesn't in some way the type of hash
function that we create
depend on the sort of searching we're
doing
that depends very much on the key space
which is why i keep referring to it
because
if your key space is something like zip
codes
what does zip codes look like in the u.s
uh the easy way is five numbers right
the
usps one is 5-4 i think okay yeah i mean
the full one would be 5-4 but i mean
like your basic zip code is you know
five um so if we use
that as our hashing function like say we
were looking up like the name of a town
based on its zip code we could actually
just that is its own hashing function
that by the way
that's called the identity hashing
function so depending on your key space
that's allowed right whereas using like
the sum
for that might be a really bad idea
right
yeah that would be less effective yeah i
mean because you know a lot of them are
probably going to sum to the same value
so you're going to get a ton of
collisions in that key space but
versus like if we were just doing it
like we could actually just use the
trivial the idea sorry the identity hash
which is use the thing as
the hash itself that works a lot in
numerical spaces but
for string kind of problems which is
sort of what we're worried about like
you know the equivalent of like a python
dictionary is kind of what we want to
implement
then um we're going to need we're going
to be taking text data so
let's assume like words so let's use
planets as our sort of starting point so
go ahead code this up give me the
function
for um this is our hash
number of letters okay
number of letters all right let me find
shoot i closed here's our here's our
multi-pad okay
ooh things have frozen for me so i'm
going to refresh
hold on okay i'll refresh too just to
make sure
yeah i think we're good okay so
okay now i don't normally start you off
with nothing here so i'll give you the
i'll give you the invocation to get you
going
thank you so you can have that
and you can know that this returns zero
and i'll
zoom in because i'm such a nice guy i'm
going to throw in stdio.h so
you can thank you you can have sda oh
and because we want to make class miles
happy and so he doesn't yell at me
let's uh i'll give you standard into and
that gives you the types like
you and a tea
so okay all right you're good to go
okay so ideally
we are making a a new function
that returns an integer
um let's return a character well an
eight bit value let's use the uint eight
business do i need to do it like that
yeah underscore two yep okay so it'll be
a small
we don't want our hash function to
return too large a hash
like the size of the index space the
size of the hash function
codomain is relatively small there's
only 256 possible values you could get
out of this because there's eight bits
so sorry really quick what is the naming
convention
in c4 functions uh that's
that's those are fighting words so i'm
gonna just
say you can just have fun and
yeah you can you can camel case we've
been we've been camera casing mostly
so let's see what is it gonna we are
expecting it in this case right so we're
counting the number of characters so
you might want to mention what hash it
is because you're going to implement a
couple hashes so this is the
okay the make that
num character makes some
oh well we're not and then we have a pun
let's do let's do the length
all right we want to do the sum first
that's fine no we could do length first
it's okay
make one
what is screaming snake case screaming
how do i do it because that sounds
awesome all right so that
sounds oh no awesome
that would be good
the pep 8 or whatever it's called for
for python on like naming conventions
well that's right there's questions
yeah and i'm sorry but no i don't care
how much people want me to do that
that's awful looking all right
just know let's no that makes me sad in
my heart
so whatever the guide says i'll i'll
probably do for
production code but it'll still make me
sad all right so
let's think here so we are expecting it
to
take in
a string right we want to feed a string
but technically that's going to be an
array of characters can i just say
character
array yep i can say array well i mean
you know you know how to define a
character
right i mean we can just kind of say
um we're going to call it word
because it's going to be a word um but
i don't have to okay no we're fine i
think i'm overthinking it
yeah we're just starting a nice easy
hash function let's not go crazy
up front i know um
rust calls for a screaming snake case
for constance i think python does too
yeah
i'm sorry i don't no i don't hold to
that
to me just no um
okay especially sorry especially in like
physical sciences where you have a lot
of constants
that it just makes it makes my code sad
okay so
let's think about this so we're given a
word
and can i use the size of the operator
what type is word here it's a character
no i've got to do the pointer thing
because it's an appointer to the thing
in the array all right yeah
it's not a single character i know
okay so
can i use sizeof or is that not okay
um no sizeof is not a
no why because you don't know this why
did you show me this magical thing if
i'm never allowed to use it
we can use it for basic types we can't
use it when we don't know like for
example you don't know if you read this
line out of a file because you know
that's what it was
screaming camel case is not a thing no
evan chad is chad is pulling the wool
over your eyes
okay screaming camel okay
yeah i'm sorry so
i think it's a thing it's a thing now
what is the name is there a proper name
for that like
emo alternating like capital and
lowercase letters that we all used in
you know the 90s what is that
please tell me that's that's the thing
i i would refer to that you got to have
the little like lower case x capital x
like brackets oh okay all right you have
a functioning
um i am focused all right so if i can't
use
miles last miles just for you we're
using you ain't eight
here tennis all right so then i will
probably
make use in this case of that
null terminating fight yep so i won't i
won't do my cheaty way that i was hoping
to do it
i like cheating okay so
let me think this through
no that won't work i'm gonna do it this
way no need
i'm gonna do it my way first i'll do the
like slow whatever way and then we can
make it pretty later
what's the first value of i i'm working
on it
hold on i'm don't look not looking
you're not allowed to look
not looking okay thank you um
doing my own thing for a second i do
things in weird order
so when this is we are going to
we're going to count there and then
but we need to find a way
of iterating
this is going to be our way of iterating
so
we're going to start with zero that
makes sense we are going to need a
count which is going to be an integer
which will also start at zero
why is it not doing a little like and
why is i broke so
let's um let's save i think the editor
got
okay i'm just probably sorry i should
have
it should have done its thing okay there
we go
okay all right that makes me feel a
little better um
so
okay so while it's not the null fight
at the very end we are going to
and i know there's some fancy way to do
it but we're not going to do that right
now i'm
gonna do it my way and then we can make
it all fancy all right
okay so while it's not the number um
that might work actually
it could be that actually
actually it will check at the beginning
this might work ah poop what did i just
do
um that actually might work
okay what are we looking at no no
no i'm not looking okay no you are not
looking i was trying not to look
but you said it would work so i had to i
know i know i know no i would like to
test it please
before i well let's get rid of that
screaming camel case
i feel that i should mention this this
exit success is kind of the modern
c approach to um returning the right
thing
uh oh okay in this case it means zero uh
i'm referring to uh
there's an old unix joke about um you
know why did the roman empire fall
uh okay lacking the number zero they had
no way to return success
so um
i like to return zero for that reason
ah here you're allowed to cheat because
you're defining it so that's okay
ha okay um hey you're not supposed to be
watching
i'm not watching i just uh i detected it
off by one error there's sort of like a
spidey sense that that
kicks in and that oh yeah
i'm trying to ignore chat too in case
they like yeah no ignore chat they're
going to tell you
i'm gonna
[Music]
um
oh oops i clicked the wrong button okay
i expected it let's see
asian declared find an identifier oh i
spelled it wrong that would also
look at this modern compiler telling you
you spelled it wrong
okay beautiful i'm ready for you to look
now okay i'm not gonna look i'm gonna
test it
okay
oh
all right half so there
looks looks pretty good something like
you know a high schooler could do but
that's fine
hey that's you got to start somewhere
now again our
our core problem is you know if we had
something
it's easy to get a collision with this
function i mean there's a lot of words
that are the same length so
this this would be an easily collidable
function but
like i told you we're not worrying about
collisions for the moment
okay i'll i'll rig the key space so that
we don't get collisions
okay for now for now we're gonna we'll
get messier later okay so
um that's your make len hash now let's
do our make some hash
piece your little pun there
ha ha yeah
make some hash all right like i i'm not
too picky about your casing but you do
have to be consistent
oh sorry i would yeah i was just typing
quickly okay
um so i think we're going to start in a
very similar
way
[Applause]
uh okay there's one thing i'm going to
knit pick in last program before you go
on
um we're not going to calculate this
with an in because we're returning a you
in eight so this
should i have it just match yeah this
should be a u n
eight so that's fine
your program will still work fine
okay
we i fixed it last miles i got it before
you did
oh oh oh i didn't get all of it but
that's okay red vowel
this no no this should be you went eight
two
okay okay um all right so
oopsies
so let's see here
all right
so i think what i'm gonna have to do is
i'm gonna need a couple of things we're
gonna need the
is that a function am i good just yeah
okay that should be fun and
oops no i need to be all fancy
and then oopsies
you ate nine new numbers
maybe i can be a little fancy
let's see if i can be a little fantastic
in trouble
oh no all right maybe
no ah all right i like doing things on
okay we'll see i'll probably get in
trouble for this all right
um
well i've always got that final exam
apache module if you wanted
let's see if i can remember how to do
this um
you don't need to cast i want to
okay well then cast the right type if
you're gonna cast
do i have to okay all right all right
let's see
um
oh lordy why is it all the way over
there stop that
no thank you no
enough of that business he needs to be
here
oh crazy
so you said i don't need to cast no no
it's fine
casting is fun
i mean it was a care so it's it's good
to cast the the
the main reason here you don't need to
cast is because care actually is a you
in date so
okay i but you're being totally correct
i mean like this is okay
very clear what you're doing okay
okay so i broke it so let's see that's
fine
broke it i don't see anything i didn't
get an output
oh that's because i just cleared it oh
stop clearing it sorry sorry sorry
um hold on let me save and refresh
because i think it's it's oh okay yeah
i'll do that too so did you hit save
i did hit save so we should be good all
right
are you having a value pop out because
for me nothing no i haven't run anything
uh here oh
that's what i meant as i broke it oh you
might have really broke it
i mean i think i really broke it all
right yes okay
you figure this out how did i really
break it okay
so that's a good question let me go
through this all right
here i'm gonna actually i'm gonna freeze
your output for a second
let me don't touch the editor for one
second okay
my hands are often okay i'm gonna reset
that
and restart and that
wonder if part of the problem here if i
can just make make a possibility
is that let me try something else real
quick um
let me try
hold on no that didn't even work okay
so the only reason what i was thinking
is that
yeah what was that uh the
so i was thinking that the sum
of all of the uh oh crap the words is
ascii numbers or whatever sorry the
words escaping my head um
might be greater than 55. overflow is
totally fine
it'll just wrap around and that that's
not a problem
that wouldn't be causing this kind of
break no no this is this is something
different
um i'm gonna you know what i'm gonna
just store your code for a second i'm
going to reset
the multi-user pad i wonder if like some
loop ran
off okay
it doesn't have infinite loop detection
of any sort so i'm just going to
interrupt this
and that should throw that thing and
then we're going to start that back up
and we'll select this backup
and i'm going to delete all that
i'm going to hit save and let's put that
back
okay so i'm going gonna comment this
thing out and we're just gonna run it
okay so this is back to working so if
this thing freezes it's because you have
an infinite loop in here
okay um so
this so let's just take the only loop we
really have here is the while loop
what are you doing to i oh i forgot to
so yeah you really did break multi-user
pad
i'm here to help you test it oh no we
that's a known broken thing i got it
fixed no that was oops
there we go all right so so let's clear
this
um it should work now well actually i
commented out make some hash so let's
oh i uncommented it and write it but
okay maybe it shows as being uncommented
on my version
okay you run it let's see what it looks
like i'll refresh
let's save and refresh this good yep
uh-oh
okay i'm gonna stop touching it for a
second yeah okay yeah it's running a
month yes
yep it's running on mine or let me try
it on mine actually real quick so let me
try
execute yes all right we're good
okay so um overflowing in hash functions
especially with like unsigned this is
totally fine um you want to wrap around
your key space
uh but you know ultimately you're
probably going to use a bigger key space
to avoid collisions but again the larger
you make the key space
the less chance of collision but you
sacrifice now you're throwing away a lot
of memory so
that's like i told you you're always
going to be trading especially with
something like a hash function
okay okay so that worked pretty well
those that you just
knocked those two hash functions out
minus one little infinite loop that took
down
which that's easy to do so
um okay so
let's do slightly better okay
um so by
slightly better uh i'm just gonna show
you one
last hash uh do you want us would you
rather see
i think we should just stay on hashing
tonight right rather than just jump
ahead into like sort of um
may have a short bathroom break yeah
yeah go right in okay i'll be right back
i'll start i'll start writing
sounds good all right
so this stuff
yeah let's just yeah for those of you in
chat i
haven't um multi-user pad is still
running
in a container that's like relatively
unwatched i mean just sort of like os
systems the thing out i think is
what i implemented it's it's glaring
security risk i mentioned that in the
docs
in the front page of the of the github
which i'm sure no one looks at
but that's um
huh that's not what i want it's we're
not doing nintendo programming tonight
so let's say uh yeah coding with some
guy
and um i got bingo up there so you can
check that out too uh this has
i think it's just an os system
so obviously we're gonna need some job
control and you know
some other stuff i was thinking about
going with um
a micro kernel architecture sort of like
um
the the micro vm style that they use for
lambda
that amazon's doing it looked pretty
heavy
so i'm trying to keep this to be
relatively simple but it's nice that
they run the whole thing in like a kvm
execution environment so then you can
really like police it and you know it's
got sort of proper runtime control i
mean because like all your other stuff
like jails you can break out of jails
you can i mean
like where's a good place to run this
stuff i mean ultimately this is supposed
to run in the browser so i guess i could
go back to that too
um i this was in the
not the it moved over here but now we're
in um
i want to say it's in maine uh but yeah
it basically just
os systems like the call and you know
things happen
uh so that
anyway um yeah okay subprocess a little
better than
that's right i got i got the error
messages now too all right so anyway
um you're back yes i'm telling them the
woes of why
uh multi-user pad uh runs away there
that's that's bug for me to figure you
know you'll you'll be able to fix it by
the time we get there
all right so um
i'm going to show you one more hash uh
like hashing
is like a pretty deep field
in general um but you you've kind of got
the general
idea of actually most of the popular
hashes which is
you know we're gonna take like each
character
we're gonna go here i'll just zoom in on
the whiteboard um
so i know nate shade dudes here so i'm
definitely using the word whiteboard
um so we're gonna take each character
one at a time we're gonna do some sort
of math
up on sort of you know the
the total hash value
and this is just going to kind of keep
turning over
as we sort of rip through our character
array and then when we're done we're
going to return
the total hash value most hash functions
look like this
not all of them but you know this i
should draw a real you there
this this is sort of like the general
pattern you're going to see for most
hashes and
the whole part that's really kind of
magical
is in here um so this is the part that
really gets better
or worse on different sorts of hashing
functions and
um again i'm i'm still ignoring the
space of like cryptographic hashes
that's that's sort of a different
it's related but it's sort of a
different there's totally different
requirements
that you're kind of looking for um so
yeah sorry nightshade dude uh i
gotta fix that so um
anyway uh the last hash that i'll show
you uh because there's
usually the the correct answer for
almost all of this is
you i'm gonna i'm gonna write this big
use
a library
okay okay this is the right hash
function
okay don't generally try
writing and implementing your own hash
function there's all sorts of like
things like avalanche and
and you know perfect hashing collisions
and there's
like you know you want to be efficient
on space and you know you want
like like there's lots of different
trades you want to run it across lots of
different data unless you
really know that like some off-the-shelf
hash functions aren't working for you
don't write a hash function like you
know okay
like like the right finding the right
hash function is going to usually be
messy
don't write one so i'm going to give
that a big
x that's really important now that said
if you're really living somewhere and
you don't have a hash function handy and
by the way this happens a lot in
microcontrollers you know
you might not have like enough library
space and you just need like a
simple little hash to get going and you
know whatever one of
like the great ones that's around is
called the djb
hash and i feel like we couldn't go
through a class and see without talking
about the djb
hash um so this is named after a dude
uh bernstein daniel i i forget his name
um it's been djb to me for so long that
i i just forget
but um and the
the djb hash is a um pretty
let's see i can not use your length hash
it's very similar in form to what you've
got
it it returns it's it's a 32-bit hash so
we're going to return
i've modernized it for our use tonight
so it's going to return a 32-bit
number which is sort of what djb had in
mind
and it's implemented in almost every
language by the way so
you you were pretty safe at
you know finding a version of this that
has been error checked and everything
else so
you're going to have some sort of string
up here and
the way this hash works is pretty much
the same as yours except
he juices a little bit he starts off
your sum
started off at zero
yes his sum doesn't he likes to use
prime numbers so
he he dropped this by the way on usenet
like years ago so it just
it kind of famously became the djb hash
so he sets it equal to 53.81 he doesn't
refer to it as some he calls it
hash because he's not actually just
going to be summing the digits
um and he's going to use a
a character that while he's running
along and you know so he's going to use
a signed 32-bit um okay character which
he's going to refer to as c
and on there he's basically just going
to say you know your same kind of
while every
character
in s um and i'm pseudo coding a little
bit
uh then hash is gonna be
the old hash you know which you you
well i'm just gonna write it out just to
be more clear you're gonna take whatever
the old hash value is
you're gonna multiply it by 33 and then
you're going to add the character
okay
and then you're going to return that so
return the hash
this is super good if you're ever in
coding interview and the interviewer by
the way when you're when
if you're ever in a coding interview try
to use the trivial or the identity hash
first okay just be like all right how
about i just hash by using the name
you can always use something like sum
they could say all right could you do
better
dj b is better so
this is this is one um it's it's less
important now like i would say you know
worth knowing like if you're going into
like a c interview but um
reasonably you can always look it up um
i will drop a link
in the chat just because like hashing is
a kind of important topic and you know
and see you don't get them
there's a nice summary of all the hashes
that i
over the years i always end up referring
to um
which is which is here and so you can
just see uh
you know they'll walk you through
different hashes and why some of them
are better or worse
djb is one of those like it's just
simple and it just works most of the
time for most key spaces
so it's a little bit better than your
sum because it's got these prime numbers
and you know this
the relationship between this 5381 and
the 33 is just
screwy enough that the hash sort of
spreads out a bit
makes sense yeah i think so so so this
is this list is very interesting so i'm
sorry
i'm pinning this to like probably look
at in more detail later but that's
yeah that's cool okay i modernized it um
for tonight so i'll just
include it here just just so um and by
the way uh
theronin i believe you're right i think
that uh
java does default to djb um
it's it's just a pretty solid hash so
here's here's djb
in more modern c um i
left actually sorry this is the not
modern version that's the version you'll
find on that page
uh the more modern version that i
modernized for your
session tonight is this which i'm just
using
these types and i'm so glad that last
mile showed up because i did this pretty
much just for him
so i'm using the the standard in type so
this is the nice stuff but
um and again like you saw i i threw a
little bit of the performance away
because these days
compilers optimize so i actually
multiply by 33.
he does this as a bit shift which we did
last week so he just bit shifts this
over
five which is the same as two to the
fifth and then he adds hash here
so i i just said all right well let's
just be clears of it
you can see how despite the fact that
djb is like such a good hash it's really
really simple
like it's just not a complicated hash it
just
works really well because this 33 and
these 5381 and the ascii character set
they all just hate each other enough
to sort of spread out and not collide
very much on words
whatever reason it's it's an interesting
kind of like mathematical problem yeah
hashing like you can get very deep like
i said
we're just touching the surface very
cursory
of um hashing in general uh there's
different hash functions that typically
apply to different sorts of domains
but anyway that's not really a topic of
c programming it's i
included here mainly because you're used
to having
dictionary type stuff in python right
you know so like if you have a database
and
you know it it has like you know
mercury and uh say that's five
and jupiter i don't think so just i
don't think our um
the multi-user pad is up for stream just
oh oh
sorry yeah in case you wanted to share
your databases with friends
i'm writing that out say this thing is
like seven
and you know i mean so when you're doing
this kind of thing
it's it's converting this into a dict uh
this becomes a
you know dictionary type this
implicitly has a hash function built
into it i'm not sure exactly which hash
function i think they use different hash
functions depending on
some of the data that comes in because
you know it's python and
whatever so like they don't they're
they're sort of building to the case
like they don't know what type of data
you're going to throw at it so
there's no perfect hash function for
everything you know there's kind of like
good
hash functions for certain domains like
so here's to be such a like
so that must be such a
flexible hash function because you can
nest dictionaries and nest so many
different data types in python then it
must be a hash function that can really
handle just quite a bit of different
data types um
that's impressive the thing is even just
my my baby hash function i built it i'm
trying to think that through
yeah i mean nathaniel just mentioned um
it's using sipash so you can actually
look that up
okay what they're doing um so anyway
uh that's that's about as far as we
should go with hashing now
the only other part of hashing that we'd
want to deal with is
collisions and how do you deal with them
right so what happens when we actually
use this to
index into an array um and
that's how i mean
just intuition wise like how would you
operate with collisions like assuming
that
some of these values might hit the same
thing
you know so for example in our number of
letters
example it's very easy for us to get a
collision right
yes so
how would we deal with the fact that
like two things hashed the same number
like if we wanted to use this in an
index of
say like a database right because when
i'm building this thing it's gonna be
probably some sort of struct and you
know i'm gonna say like
i'll use my planets example so when i
had my planet thing
it was um we had
uh a care for the name just like
20 and then we had like you know the
afloat for the orbital period
um so if we build a list of planets
and say we wanted to build it for a hash
because we knew we were going to add
exoplanets and all sorts of other stuff
to it and it was going to get big
even though exoplanets would totally
change your hashing function because
they're usually named as like
really short letter number designations
so so
the type of hash function you're going
to use for that is very different than
the kind of hash function you would use
for like
celestial well i shouldn't say celestia
like solar
system type bodies named bodies yeah
named bodies um but let's just assume
for the moment we'll just
simplify it so we have like uh a hash of
these
um which is going to be a bunch of
pointers
to our database which are actually going
to go to the concrete individual records
and we'll just
allocate nine let's allocate a thousand
um
then and say we had this stuff all set
up
let's assume that two of these planets
like collide
okay hash wise i guess it'd be very bad
if they
collided otherwise because you know if
we were using
the number of letters we had venus and
we had earth
all right we got a collision right there
um so
if they both end up mapping to 5
then i can't just do a simple assignment
right i can't say like db
of 5 which is really like
hash of earth
uh or passion earth
yeah what would you do to handle
collisions like something
and db of hash of venus
is something i'm going to end up
overwriting
the same value using this hash function
we just described
so if you have one that so so here's my
thought but i think it's wrong
is that my first thought is that you can
have some kind of contingency for
if the value if the hash value already
exists right but that kind of when
you're talking about your three million
things
well now you're talking about another
kind of search
already in there yeah that's good
intuition that doesn't seem right
because
my first thought was okay well just say
that if it already exists
then you can like set
the hash of venus either to something
else or use a different hash function or
but all of those have serious problems
with them yeah
because it's predicated on another
like a nested sort in a way so you are
going to have to adopt a convention
you're you're right in your intuition
like you're going to need
some way to deal with collisions okay so
um
if i keep the name in the hash i can
actually
look at the entry and say was this i
don't have to trust the hash
implicitly like you like
venus equals like venus one or something
so that like you're equating the hashes
of like
you're slightly changing the name in
like a predictable fashion
that would be one option but now i'm
messing around with the key space but
yeah now you're messing with the key
okay so that would be
one way to go um okay a kind of simpler
way to go
is override these entries and allow them
all to be arrays
so instead of defining our data
structure the way we just did
we could say that each one of these
database entries is actually
another array let's say however many
collisions we want to allow
so we could say that like once you
get to that particular place make sure
that
you've actually got the right planet you
know so so then we might use
you know because before we were avoiding
going through the entire list because
the list might be
large now we've sort of cut it down like
we know that this is a five letter
planet so just look in the five-letter
planets bucket
you know and say this is the bucketing
okay this is the bucketing approach
right so this is basically
and this could be implemented as a
linked list um if you want an infinite
number
you you could you could implement this
as a fixed size i'm trying to give a
simpler example here so
i'm going to say that there's at most
five entries so what you would do the
convention we could adopt is
look in db at the hash value and then go
through
each of those until you find the right
one that'll still cut down our search
time a lot even though our hash function
has collisions now the pathological case
would be
all things hash to one value we have
such an amazingly bad
hashing function that every and this by
the way does happen in computer science
so like
it wait how i don't want to get into how
it's just
like we could have a terrible terrible
hashing function that everything hashes
to the same thing
and even if we made it a linked list now
we're just gonna have to traverse the
entire list anyway so our hashing
okay save does nothing but um
this is like kind of where you get into
like the how good is your hash function
like how much does it spread the data
out and that's by the way that
like if two if two labels are really
close to each other but you want their
hash functions to be
their hash values to be far apart that's
called avalanche like so you're
you you could say that your your hashing
function has the avalanche property
because it it
spreads things that are like like so say
it's like venus and v
news i don't know then like
okay like if if those compute to almost
the same number
that's not avalanche so if they compute
to like
really different numbers like there's
wildly distributed in the index space
that would be avalanche so you can it
reminds me a little bit and i know i
know it's
different but i'm just saying like some
of those concepts remind me a little bit
of
like decision trees and entropy and
how you are looking for different ways
to when you split your data
looking at the balance of data with
respect to different characteristics
so it's interesting hash equals seven
always okay so oh god you could have
this would be a pathological
bad case of a hash function because oh
no dennis this is what's gonna go down
in history associated with your name
no no this one dennis would never
there's actually a funny uh
he's kind of he may or may not
consciously be doing it but he's sort of
playing off an xkcd that
like you know says like okay you know
get random number
and it just always returns four
like how could you prove that that's not
random you know
it might return something else you know
cosmic ray could hit the computer
anyway so um this is not really the last
miles
so i'm not not not to pick up those um
so
anyway bucketing is one approach that
you can use
um another approach is uh
called the next like you know kind of
like like go forward
that just pushed the oh i just wanted to
copy
no that's okay i didn't realize it did
that actually um
my bad no no you're fine uh so the the
next approach is basically
if there is a collision like initialize
everything with sort of null
and if you hash and you try to put
something in there
and there's already something there just
add one to it
okay and just keep adding one until you
get an empty spot
okay so those are the two most
common strategies for dealing with hash
collisions
i don't think there's any point in
implementing them i think it's probably
fine to pseudo code
how do you feel i mean i'm following you
yeah no we're good
okay well i think that pretty much wraps
up hashing
okay