Friday, 8 January 2021

How to break a simple substitution cipher

 



G KWTB UNNH NUEWSO BNZB TE BOPB BON VNBBNMT PMN HGTBMGAWBNH GU P IPUUNM BJFGYPV ED

BON NUSVGTO VPUSWPSN TE BOPB GB YPU AN AMEQNU WTGUS TBPBGTBGYPV INBOEHT.

UNNHVNTT BE TPJ, FPMPSMPFOT BOPB PREGH P FPMBGYWVPM VNBBNM XGVV AN IEMN

YEIFVGYPBNH.

BON VEUSNM JEWM BNZB, BON IEMN EARGEWT BONTN FPBBNMUT XGVV ANYEIN. DEM

NZPIFVN, GD JEW XNMN BE PUPVJTN PU NUBGMN AEEQ, JEW XGVV TNN P RNMJ

BJFGYPV HGTBMGAWBGEU.


Look at the text above. You are given that it is in English, and that it is a simple substitution cipher. The text is in the description below.

Here, the focus is on breaking the cipher. Do take a look at the video above for an explanation. 

Breaking the cipher means we’re not starting with the key. i.e., we don’t know which letter maps to which.

There are several ways to start with this. One of the easiest is to start with statistics – just count the number of times a particular letter appears in a text.

You can find this information on Wikipedia – go to the article on letter frequency. It shows you the relative frequency of letters in English. There are two sets of statistics, for texts and for dictionaries. We need to look at texts, because we are trying to decipher a paragraph.

What we see right away is that the letter with the highest frequency is E, which accounts for 13% of texts. The next highest are A and T, which accounts for 8.2% and 9.1% of texts respectively.

In order to apply this information, let’s put our text into a frequency counter. You can find several of these online, or you can make your own. Or you can count it yourself, I won’t judge.

The frequencies I get are show on the screen:

A

8

N

44

B

38

O

12

C

0

P

28

D

3

Q

2

E

23

R

3

F

8

S

7

G

23

T

20

H

8

U

17

I

7

V

19

J

8

W

11

K

1

X

4

L

0

Y

8

M

20

Z

3

 

As you can see, N has the highest frequency. So a safe bet would be to assume that it is E.

G KWTB UeeH eUEWSO BeZB TE BOPB BOe VeBBeMT PMe HGTBMGAWBeH GU P IPUUeM BJFGYPV

ED BOe eUSVGTO VPUSWPSe TE BOPB GB YPU Ae AMEQeU WTGUS TBPBGTBGYPV IeBOEHT.

UeeHVeTT BE TPJ, FPMPSMPFOT BOPB PREGH P FPMBGYWVPM VeBBeM XGVV Ae IEMe

YEIFVGYPBeH.

BOe VEUSeM JEWM BeZB, BOe IEMe EARGEWT BOeTe FPBBeMUT XGVV AeYEIe. DEM

eZPIFVe, GD JEW XeMe BE PUPVJTe PU eUBGMe AEEQ, JEW XGVV Tee P ReMJ

BJFGYPV HGTBMGAWBGEU.

Yes, I did a case sensitive search and replace, which you should be able to do with any good word processor.

One thing you might notice immediately is the triplet BOe. It appears four times in the short text. It could be ‘THE’, In order to confirm that, we can look at our letter frequencies. As expected, B has a frequency of 38, and is the second most numerous letter. It would be safe to assume B = T, and therefore, that BOe = the. Let’s replace B with T and O with H.

G KWTt UeeH eUEWSh teZt TE thPt the VetteMT PMe HGTtMGAWteH GU P IPUUeM tJFGYPV

ED the eUSVGTh VPUSWPSe TE thPt Gt YPU Ae AMEQeU WTGUS TtPtGTtGYPV IethEHT.

UeeHVeTT tE TPJ, FPMPSMPFhT thPt PREGH P FPMtGYWVPM VetteM XGVV Ae IEMe

YEIFVGYPteH.

the VEUSeM JEWM teZt, the IEMe EARGEWT theTe FPtteMUT XGVV AeYEIe. DEM

eZPIFVe, GD JEW XeMe tE PUPVJTe PU eUtGMe AEEQ, JEW XGVV Tee P ReMJ

tJFGYPV HGTtMGAWtGEU.

 

Here, 'thPt' looks a lot like ‘that’. You might also notice that P appears by itself quite a lot. It has a frequency of 28, which is quite high. So, P = A looks like a good assumption.

G KWTt UeeH eUEWSh teZt TE that the VetteMT aMe HGTtMGAWteH GU a IaUUeM tJFGYaV

ED the eUSVGTh VaUSWaSe TE that Gt YaU Ae AMEQeU WTGUS TtatGTtGYaV IethEHT.

UeeHVeTT tE TaJ, FaMaSMaFhT that aREGH a FaMtGYWVaM VetteM XGVV Ae IEMe

YEIFVGYateH.

the VEUSeM JEWM teZt, the IEMe EARGEWT theTe FatteMUT XGVV AeYEIe. DEM

eZaIFVe, GD JEW XeMe tE aUaVJTe aU eUtGMe AEEQ, JEW XGVV Tee a ReMJ

tJFGYaV HGTtMGAWtGEU.

The only other standalone letter that appears frequently in English is I, and the only other standalone letter in the cipher is G. The frequency is 23, which is good enough to proceed. Another interesting observation is ‘tE’, and that can only be ‘to’. Frequency for ‘E’ is 23, which is good enough. Let’s replace G with I and E with O.

i KWTt UeeH eUoWSh teZt To that the VetteMT aMe HiTtMiAWteH iU a IaUUeM tJFiYaV

oD the eUSViTh VaUSWaSe To that it YaU Ae AMoQeU WTiUS TtatiTtiYaV IethoHT.

UeeHVeTT to TaJ, FaMaSMaFhT that aRoiH a FaMtiYWVaM VetteM XiVV Ae IoMe

YoIFViYateH.

the VoUSeM JoWM teZt, the IoMe oARioWT theTe FatteMUT XiVV AeYoIe. DoM

eZaIFVe, iD JoW XeMe to aUaVJTe aU eUtiMe AooQ, JoW XiVV Tee a ReMJ

tJFiYaV HiTtMiAWtioU

‘T’ looks pretty consistent with ‘S’, as you have ‘To’, ‘Tee’, a frequency of 20, and quite a few words that begin with it. Let’s go ahead and replace it. At this point, this is mostly an art. The result is:

i KWst UeeH eUoWSh teZt so that the VetteMs aMe HistMiAWteH iU a IaUUeM tJFiYaV

oD the eUSVish VaUSWaSe so that it YaU Ae AMoQeU WsiUS statistiYaV IethoHs.

UeeHVess to saJ, FaMaSMaFhs that aRoiH a FaMtiYWVaM VetteM XiVV Ae IoMe

YoIFViYateH.

the VoUSeM JoWM teZt, the IoMe oARioWs these FatteMUs XiVV AeYoIe. DoM

eZaIFVe, iD JoW XeMe to aUaVJse aU eUtiMe AooQ, JoW XiVV see a ReMJ
tJFiYaV HistMiAWtioU.

 

Let’s take another look at the frequencies table. N and R are two high frequency letters that aren’t accounted for yet. M has a frequency of 20, and ‘aMe’ looks like a tell. Let’s assume it’s R. If you are wrong, you can always go back.

i KWst UeeH eUoWSh teZt so that the Vetters are HistriAWteH iU a IaUUer tJFiYaV

oD the eUSVish VaUSWaSe so that it YaU Ae AroQeU WsiUS statistiYaV IethoHs.

UeeHVess to saJ, FaraSraFhs that aRoiH a FartiYWVar Vetter XiVV Ae Iore

YoIFViYateH.

the VoUSer JoWr teZt, the Iore oARioWs these FatterUs XiVV AeYoIe. Dor

eZaIFVe, iD JoW Xere to aUaVJse aU eUtire AooQ, JoW XiVV see a RerJ

tJFiYaV HistriAWtioU. 

From here on, you need to see if you can spot any probable words, and start replacing letters. ‘Vetters’ could be ‘letters’, ‘D’ could be ‘f’ (‘oD’ and ‘Dor’), ‘F’ looks like it could be ‘P’, and ‘U’ could be ‘N’. Let’s replace these, and see it if makes sense.

i KWst neeH enoWSh teZt so that the letters are HistriAWteH in a Ianner tJpiYal

of the enSlish lanSWaSe so that it Yan Ae AroQen WsinS statistiYal IethoHs.

neeHless to saJ, paraSraphs that aRoiH a partiYWlar letter Xill Ae Iore

YoIpliYateH.

the lonSer JoWr teZt, the Iore oARioWs these patterns Xill AeYoIe. for

eZaIple, if JoW Xere to analJse an entire AooQ, JoW Xill see a RerJ

tJpiYal HistriAWtion.

Now, ‘S’ can be replaced with ‘G’, ‘W’ with U, and ‘Y’ with C, ‘H’ with D, and ‘X’ with W. These were based on common words you can identify from the text.

i Kust need enough teZt so that the letters are distriAuted in a Ianner tJpical

of the english language so that it can Ae AroQen using statistical Iethods.

needless to saJ, paragraphs that aRoid a particular letter will Ae Iore

coIplicated.

the longer Jour teZt, the Iore oARious these patterns will AecoIe. for

eZaIple, if Jou were to analJse an entire AooQ, Jou will see a RerJ

tJpical distriAution. 

From observation, we can now get ‘Z’ = X, ‘A’ = B, ‘J’ = Y, ‘K’ = J, ‘I’  =M, ‘R’ = V. Do the replacements.

i just need enough text so that the letters are distributed in a manner typical

of the english language so that it can be broQen using statistical methods.

needless to say, paragraphs that avoid a particular letter will be more

complicated.

the longer your text, the more obvious these patterns will become. for

example, if you were to analyse an entire booQ, you will see a very

typical distribution. 

By now, it’s obvious that ‘Q’ = K. Just replace it, and you have your deciphered message.

i just need enough text so that the letters are distributed in a manner typical

of the english language so that it can be broken using statistical methods.

needless to say, paragraphs that avoid a particular letter will be more

complicated.

the longer your text, the more obvious these patterns will become. for

example, if you were to analyse an entire book, you will see a very

typical distribution.

 

This method works provided your message is long enough. If your message is not in English, you need the frequency tables for the language you’re working with.

If you know you’re working with a Caesar cipher, the first letter you definitively break will give you the key right away. Use that to find your shift, and just decode the whole message normally. If you can’t find anything definitive, you can just break the whole message normally anyway.

Go ahead and try this, and let me know how it goes.

You can follow me on Facebook here or on YouTube here.

 See you next time!




No comments:

Post a comment

Introduction to airships

  This is a follow up from the article on types of flight. Here, we’ll be taking a further look at airships.  I’m sure most of you know wh...