G KWTB UNNH NUEWSO BNZB TE BOPB BON VNBBNMT PMN HGTBMGAWBNH GU P IPUUNM BJFGYPV ED
BON NUSVGTO VPUSWPSN TE BOPB GB YPU AN AMEQNU WTGUS TBPBGTBGYPV INBOEHT.
UNNHVNTT BE TPJ, FPMPSMPFOT BOPB PREGH P FPMBGYWVPM VNBBNM XGVV AN IEMN
YEIFVGYPBNH.
BON VEUSNM JEWM BNZB, BON IEMN EARGEWT BONTN FPBBNMUT XGVV ANYEIN. DEM
NZPIFVN, GD JEW XNMN BE PUPVJTN PU NUBGMN AEEQ, JEW XGVV TNN P RNMJ
BJFGYPV HGTBMGAWBGEU.
Look at the
text above. You are given that it is in English, and that it is a
simple substitution cipher. The text is in the description below.
Here, the focus is on breaking the cipher. Do take a look at the video above for an explanation.
Breaking
the cipher means we’re not starting with the key. i.e., we don’t know which
letter maps to which.
There are
several ways to start with this. One of the easiest is to start with statistics
– just count the number of times a particular letter appears in a text.
You can
find this information on Wikipedia – go to the article on letter frequency. It
shows you the relative frequency of letters in English. There are two sets of
statistics, for texts and for dictionaries. We need to look at texts, because
we are trying to decipher a paragraph.
What we see
right away is that the letter with the highest frequency is E, which accounts
for 13% of texts. The next highest are A and T, which accounts for 8.2% and
9.1% of texts respectively.
In order to
apply this information, let’s put our text into a frequency counter. You can
find several of these online, or you can make your own. Or you can count it
yourself, I won’t judge.
The
frequencies I get are show on the screen:
A
|
8
|
N
|
44
|
B
|
38
|
O
|
12
|
C
|
0
|
P
|
28
|
D
|
3
|
Q
|
2
|
E
|
23
|
R
|
3
|
F
|
8
|
S
|
7
|
G
|
23
|
T
|
20
|
H
|
8
|
U
|
17
|
I
|
7
|
V
|
19
|
J
|
8
|
W
|
11
|
K
|
1
|
X
|
4
|
L
|
0
|
Y
|
8
|
M
|
20
|
Z
|
3
|
As you can
see, N has the highest frequency. So a safe bet would be to assume that it is
E.
G KWTB UeeH eUEWSO BeZB TE BOPB BOe VeBBeMT PMe HGTBMGAWBeH GU P IPUUeM BJFGYPV
ED BOe eUSVGTO VPUSWPSe TE BOPB GB YPU Ae AMEQeU WTGUS TBPBGTBGYPV IeBOEHT.
UeeHVeTT BE TPJ, FPMPSMPFOT BOPB PREGH P FPMBGYWVPM VeBBeM XGVV Ae IEMe
YEIFVGYPBeH.
BOe VEUSeM JEWM BeZB, BOe IEMe EARGEWT BOeTe FPBBeMUT XGVV AeYEIe. DEM
eZPIFVe, GD JEW XeMe BE PUPVJTe PU eUBGMe AEEQ, JEW XGVV Tee P ReMJ
BJFGYPV HGTBMGAWBGEU.
Yes, I did
a case sensitive search and replace, which you should be able to do with any
good word processor.
One thing
you might notice immediately is the triplet BOe. It appears four times in the
short text. It could be ‘THE’, In order to confirm that, we can look at our
letter frequencies. As expected, B has a frequency of 38, and is the second
most numerous letter. It would be safe to assume B = T, and therefore, that BOe
= the. Let’s replace B with T and O with H.
G KWTt UeeH eUEWSh teZt TE thPt the VetteMT PMe HGTtMGAWteH GU P IPUUeM tJFGYPV
ED the eUSVGTh VPUSWPSe TE thPt Gt YPU Ae AMEQeU WTGUS TtPtGTtGYPV IethEHT.
UeeHVeTT tE TPJ, FPMPSMPFhT thPt PREGH P FPMtGYWVPM VetteM XGVV Ae IEMe
YEIFVGYPteH.
the VEUSeM JEWM teZt, the IEMe EARGEWT theTe FPtteMUT XGVV AeYEIe. DEM
eZPIFVe, GD JEW XeMe tE PUPVJTe PU eUtGMe AEEQ, JEW XGVV Tee P ReMJ
tJFGYPV HGTtMGAWtGEU.
Here, 'thPt' looks a lot like ‘that’. You might also
notice that P appears by itself quite a lot. It has a frequency of 28, which is
quite high. So, P = A looks like a good assumption.
G KWTt UeeH eUEWSh teZt TE that the VetteMT aMe HGTtMGAWteH GU a IaUUeM tJFGYaV
ED the eUSVGTh VaUSWaSe TE that Gt YaU Ae AMEQeU WTGUS TtatGTtGYaV IethEHT.
UeeHVeTT tE TaJ, FaMaSMaFhT that aREGH a FaMtGYWVaM VetteM XGVV Ae IEMe
YEIFVGYateH.
the VEUSeM JEWM teZt, the IEMe EARGEWT theTe FatteMUT XGVV AeYEIe. DEM
eZaIFVe, GD JEW XeMe tE aUaVJTe aU eUtGMe AEEQ, JEW XGVV Tee a ReMJ
tJFGYaV HGTtMGAWtGEU.
The only
other standalone letter that appears frequently in English is I, and the only
other standalone letter in the cipher is G. The frequency is 23, which is good
enough to proceed. Another interesting observation is ‘tE’, and that can only
be ‘to’. Frequency for ‘E’ is 23, which is good enough. Let’s replace G with I and
E with O.
i KWTt UeeH eUoWSh teZt To that the VetteMT aMe HiTtMiAWteH iU a IaUUeM tJFiYaV
oD the eUSViTh VaUSWaSe To that it YaU Ae AMoQeU WTiUS TtatiTtiYaV IethoHT.
UeeHVeTT to TaJ, FaMaSMaFhT that aRoiH a FaMtiYWVaM VetteM XiVV Ae IoMe
YoIFViYateH.
the VoUSeM JoWM teZt, the IoMe oARioWT theTe FatteMUT XiVV AeYoIe. DoM
eZaIFVe, iD JoW XeMe to aUaVJTe aU eUtiMe AooQ, JoW XiVV Tee a ReMJ
tJFiYaV HiTtMiAWtioU
‘T’ looks
pretty consistent with ‘S’, as you have ‘To’, ‘Tee’, a frequency of 20, and
quite a few words that begin with it. Let’s go ahead and replace it. At this
point, this is mostly an art. The result is:
i KWst UeeH eUoWSh teZt so that the VetteMs aMe HistMiAWteH iU a IaUUeM tJFiYaV
oD the eUSVish VaUSWaSe so that it YaU Ae AMoQeU WsiUS statistiYaV IethoHs.
UeeHVess to saJ, FaMaSMaFhs that aRoiH a FaMtiYWVaM VetteM XiVV Ae IoMe
YoIFViYateH.
the VoUSeM JoWM teZt, the IoMe oARioWs these FatteMUs XiVV AeYoIe. DoM
eZaIFVe, iD JoW XeMe to aUaVJse aU eUtiMe AooQ, JoW XiVV see a ReMJ
tJFiYaV HistMiAWtioU.
Let’s take
another look at the frequencies table. N and R are two high frequency letters
that aren’t accounted for yet. M has a frequency of 20, and ‘aMe’ looks like a
tell. Let’s assume it’s R. If you are wrong, you can always go back.
i KWst UeeH eUoWSh teZt so that the Vetters are HistriAWteH iU a IaUUer tJFiYaV
oD the eUSVish VaUSWaSe so that it YaU Ae AroQeU WsiUS statistiYaV IethoHs.
UeeHVess to saJ, FaraSraFhs that aRoiH a FartiYWVar Vetter XiVV Ae Iore
YoIFViYateH.
the VoUSer JoWr teZt, the Iore oARioWs these FatterUs XiVV AeYoIe. Dor
eZaIFVe, iD JoW Xere to aUaVJse aU eUtire AooQ, JoW XiVV see a RerJ
tJFiYaV HistriAWtioU.
From here
on, you need to see if you can spot any probable words, and start replacing
letters. ‘Vetters’ could be ‘letters’, ‘D’ could be ‘f’ (‘oD’ and ‘Dor’), ‘F’
looks like it could be ‘P’, and ‘U’ could be ‘N’. Let’s replace these, and see
it if makes sense.
i KWst neeH enoWSh teZt so that the letters are HistriAWteH in a Ianner tJpiYal
of the enSlish lanSWaSe so that it Yan Ae AroQen WsinS statistiYal IethoHs.
neeHless to saJ, paraSraphs that aRoiH a partiYWlar letter Xill Ae Iore
YoIpliYateH.
the lonSer JoWr teZt, the Iore oARioWs these patterns Xill AeYoIe. for
eZaIple, if JoW Xere to analJse an entire AooQ, JoW Xill see a RerJ
tJpiYal HistriAWtion.
Now, ‘S’
can be replaced with ‘G’, ‘W’ with U, and ‘Y’ with C, ‘H’ with D, and ‘X’ with
W. These were based on common words you can identify from the text.
i Kust need enough teZt so that the letters are distriAuted in a Ianner tJpical
of the english language so that it can Ae AroQen using statistical Iethods.
needless to saJ, paragraphs that aRoid a particular letter will Ae Iore
coIplicated.
the longer Jour teZt, the Iore oARious these patterns will AecoIe. for
eZaIple, if Jou were to analJse an entire AooQ, Jou will see a RerJ
tJpical distriAution.
From
observation, we can now get ‘Z’ = X, ‘A’ = B, ‘J’ = Y, ‘K’ = J, ‘I’ =M, ‘R’ = V. Do the replacements.
i just need enough text so that the letters are distributed in a manner typical
of the english language so that it can be broQen using statistical methods.
needless to say, paragraphs that avoid a particular letter will be more
complicated.
the longer your text, the more obvious these patterns will become. for
example, if you were to analyse an entire booQ, you will see a very
typical distribution.
By now,
it’s obvious that ‘Q’ = K. Just replace it, and you have your deciphered
message.
i just need enough text so that the letters are distributed in a manner typical
of the english language so that it can be broken using statistical methods.
needless to say, paragraphs that avoid a particular letter will be more
complicated.
the longer your text, the more obvious these patterns will become. for
example, if you were to analyse an entire book, you will see a very
typical distribution.
This method
works provided your message is long enough. If your message is not in English,
you need the frequency tables for the language you’re working with.
If you know
you’re working with a Caesar cipher, the first letter you definitively break
will give you the key right away. Use that to find your shift, and just decode
the whole message normally. If you can’t find anything definitive, you can just
break the whole message normally anyway.
Go ahead
and try this, and let me know how it goes.
You can follow me on Facebook here or on YouTube here.
See you next time!