Wednesday 4 October 2017

Simple Substitution Ciphers

To start with, I will not be discussing the history of ciphers, etc. in much depth here. The main focus of this article will be on the use of simple substitution in a fictional setting. I dabble with writing myself , and for me, personally, this is one of the most entertaining areas in detective fiction.

How does simple substitution work?

It's exactly what is says on the tin - you simply substitute a character in the plaintext with a different character. For example, if you use H=K, E=H, L=O, O=R, HELLO will be coded as KHOOR. To decode, you simply reverse the substitution (i.e., K=H, H=E, O=L, R=O).
The cipher used here is the Caesar Cipher, which involves shifting the alphabet by a fixed number of characters, in this case 3. You can read more about it on the Wikipedia article.

The possibilities

There are many ways you can play with this simple cipher. Rather than use a fixed shift like in this cipher, you can invert the alphabet, or randomly distribute the letters. It doesn't have to be limited to letters - you can use symbols just as easily (The dancing men from 'The adventure of the dancing men'). The resulting alphabets are given below.

Some possible cipher alphabets
To demonstrate, the text 'This is a sample text' is encoded using all 5 alphabets below. The spaces are left intact, but the text is entirely in upper case.

Caesar 3:                  WKLV LS D VDPSOH WHAW
Reverse alphabet:      GSRH RH Z HZNKOV GVCG
Random:                   LHPM PM C MADMWA LAXL
Symbols:                   &)<{ <{ - {-@_>; &;}&
Symbols 2:               




Further, you can be creative with the alphabets. you can make an uppercase-lowercase distinction and increase the number of characters that must be decoded. If you're feeling particularly evil (or masochistic, depending on your point of view) you can choose to include numbers and punctuation marks in your code.

Then there is the issue of handling spaces. Of course you can break the ciphertext into words, but that will make things a lot easier for anyone trying to break the code. A common tactic is to replace the space with a relatively uncommon letter ('Z', for example, if you're using English). You can also omit spaces altogether, though it can cause confusion. You can also include the space as one of the characters used in your code (resulting in random spaces throughout the text - which is good, but it can also result in double spaces, which, depending on the font, is bad). You also use different methods to mark the end of the word, like the flag in the dancing men code, or changing the font of the character at the end of the word.

A further layer of confusion can be added by foregoing a simple one-to-one mapping - for example by mapping a relatively common character ('E' or space) to two or more relatively uncommon characters ('Z' or the comma). It should be noted that this can confuse the decoder as well if these characters are not chosen carefully.

Decoding

The greatest Achilles heel of substitution ciphers is that it leaves the letter frequency intact. For example, in English, the most common characters will likely be the space and 'E'. After that, 'A', 'T', 'O', 'I', etc. are quite common. This can change from text to text, so that alone isn't completely reliable if the amount of cipher text you have is small.

In addition to this, the structure of the words can give you a lot of clues. For example, standalone letters are likely to be 'A', 'I', or even 'U'. If the same three letters keep reappearing, it's quite likely 'THE' (this is a bit of a stretch, but is a good working hypothesis).

If you're using a type of cipher which follows a regular pattern (inverting the alphabet or the aforementioned Caesar cipher), decoding a few letters could lead to the rest of the code unraveling like the fictional sweater (or not so fictional ribbon) once the person breaking the code realizes what's up.

An example

An example will be given here - I will decode it below, but first I will give you only the ciphertext to give you an opportunity to solve it by yourself. The text is in English. This is a simple example - I haven't replaced the spaces or removed all the punctuation, no numbers are used, and only uppercase is used.

Cipher text:
STY NSANXNGQJ GZY ZSSTYNHJI BFYXTS. DTZ INI STY PSTB BMJWJ YT QTTP XT DTZ RNXXJI FQQ YMFY BFX NRUTWYFSY. N HFS SJAJW GWNSL DTZ YT WJFQNEJ YMJ NRUTWYFSHJ TK XQJJAJX, YMJ XZLLJXYNAJSJXX TK YMZRG SFNQX, TW YMJ LWJFY NXXZJX YMFY RFD MFSL KWTR F GTTY QFHJ. STB BMFY IT DTZ LFYMJW KWTR YMFY BTRFSX FUUJFWFSHJ? IJXHWNGJ NY.



Decoding the example:

Even without a frequency analysis, it is immediately obvious that N and F stand for I or a, since these are the only single letters that appear in the text.
Counting the number of times a character appears in the text gives J as the most frequent letter appearing 27 times in the text, followed by Y and T which appears 26 times, and F which appears 22 times. J=E is a reasonable assumption, and substituting N=I and F=A (A tends to be more frequent than I. If it doesn't work, you can always try switching them), the following result is obtained. The decoded letters are boldface:

STY ISAIXIGQE GZY ZSSTYIHEI BAYXTS. DTZ III STY PSTB BMEWE YT QTTP XT DTZ RIXXEI AQQ YMAY BAX IRUTWYASY. I HAS SEAEW GWISL DTZ YT WEAQIEE YME IRUTWYASHE TK XQEEAEX, YME XZLLEXYIAESEXX TK YMZRG SAIQX, TW YME LWEAY IXXZEX YMAY RAD MASL KWTR A GTTY QAHE. STB BMAY IT DTZ LAYMEW KWTR YMAY BTRASX AUUEAWASHE? IEXHWIGE IY.

Something that immediately appears is the three letters YMJ, where the last letter is an E. This is most likely 'THE', which gives Y=T (which agrees with our frequency analysis) and M=H. Plugging this in gives:

STT ISAIXIGQE GZT ZSSTTIHEI BATXTS. DTZ III STT PSTB BHEWE TT QTTP XT DTZ RIXXEI AQQ THAT BAX IRUTWTAST. I HAS SEAEW GWISL DTZ TT WEAQIEE THE IRUTWTASHE TK XQEEAEX, THE XZLLEXTIAESEXX TK THZRG SAIQX, TW THE LWEAT IXXZEX THAT RAD HASL KWTR A GTTT QAHE. STB BHAT IT DTZ LATHEW KWTR THAT BTRAFSX AUUEAWASHE? IEXHWIGE IT.

'YT' appears twice in the text, and since we know that Y=T, T=O is a reasonable conclusion.  'BMFY' is _HAT, and for this, B=W is a reasonable guess. At this point, if you happen to notice that it is a Caesar cipher, that is, the cipher text is always 5 letters ahead of the plaintext, the rest of the message can be decoded immediately. However, if we ignore it, we get:

SOT ISAIXIGQE GZT ZSSOTIHEI WATXOS. DOZ III SOT PSOW WHEWE TO QOOP XO DOZ RIXXEI AQQ THAT WAX IRUOWTAST. I HAS SEAEW GWISL DOZ TO WEAQIEE THE IRUOWTASHE OK XQEEAEX, THE XZLLEXTIAESEXX OK THZRG SAIQX, OW THE LWEAT IXXZEX THAT RAD HASL KWOR A GOOT QAHE. SOW WHAT IO DOZ LATHEW KWOR THAT WORASX AUUEAWASHE? IEXHWIGE IT.

 Now we can guess X=S (WAX = WAS), Q=L (AQQ = ALL), S=N (SOT, SOW), W = R (WHEWE), which also gives K = F (OK, 2 occurrences), I = D (III, IO), D=Y and Z = U (DOU, 4 occurrences, possibly 'YOU').

NOT INAISIGLE GUT UNNOTIHED WATSON. YOU DID NOT PNOW WHERE TO LOOP SO YOU RISSED ALL THAT WAS IRUORTANT. I HAN NEAER GRINL YOU TO REALIEE THE IRUORTANHE OF SLEEAES, THE SULLESTIAENESS OF THURG NAILS, OR THE LREAT ISSUES THAT RAY HANL FROR A GOOT LAHE. NOW WHAT DO YOU LATHER FROR THAT WORANS AUUEARANHE? DESHRIGE IT.

Now the text is almost readable. Plugging in the obvious letters (A=V, G=B, H=C, P=K, R=M, U=P, L=G, E=Z) gives the decoded message:

NOT INVISIBLE BUT UNNOTICED WATSON. YOU DID NOT KNOW WHERE TO LOOK SO YOU MISSED ALL THAT WAS IMPORTANT. I CAN NEVER BRING YOU TO REALIZE THE IMPORTANCE OF SLEEVES, THE SUGGESTIVENESS OF THUMB NAILS, OR THE GREAT ISSUES THAT MAY HANG FROM A BOOT LACE. NOW WHAT DO YOU GATHER FROM THAT WOMANS APPEARANCE? DESCRIBE IT.

This is a random extract from 'A Case of Identity', from 'The adventures of Sherlock Holmes' by Sir Arthur Conan Doyle. This particular excerpt is taken from Wikisource.com.

You can try adding numbers, as well as upper/lower case differentiation, and punctuation into the mix. These can make the code surprisingly complex. I will try to add a few such examples later.

To conclude

Simple substitution ciphers are a wonderfully entertaining type of cipher, especially in a fictional setting. The reader can solve it themselves and it doesn't require specialized software to solve it. It can also be surprisingly secure if the amount of text you have is small - which is a good excuse to get your heroes to look for more ciphertext. 

Happy writing/reading/ciphering/deciphering!
Falcon-15-X-C

No comments:

Post a Comment

How to write a character who is smarter than you

We all have that one character (or few) who is significantly smarter than the writer. So, as a writer, how do you write such a character con...