Encoding vs Encryption vs Hashing

Its easy to confuse these concepts as they are often used in conjunction with each other and mean similar things. However, each has a distinct technical meaning whose usage is applicable / not applicable in different contexts.

Each method discussed below uses an algorithm to convert data from one format to another. However, these algorithms differ in the details of their functionality and their overall use cases.

Encoding

In a general sense, encoding means to turn information into code. In computation, encoding is the process of converting digital data into a symbolic representations of that data. Often when people talk about encodings in CS they’re referring to B2T or binary-to-text encoding where data is converted to a sequence of printable text based characters.

Encoding’s use case is well summarized by this stack overflow thread where user Pooranachandran Muthusamy says,

“Encoding is for maintaining data usability and can be reversed by employing the same algorithm that encoded the content, i.e. no key is used.”

You’re likely familiar with the ASCII and UTF-8 character encoding schemes. These character encodings convert binary numeric values into human readable text characters. For example, under ASCII the binary value for 65 (01000001) is equivalent to the character A.

Another common encoding that you may run into is base64. Base64 is commonly uses to encode files such as images (png, jpeg) to text so that they can be sent over Email or encoded within a URL.

Encryption

Encryption is the process whereby data is converted into a unique unreadable format so that it can be transmitted then later securely decrypted and read. This is accomplished via the use of an encryption algorithm and unique key(s).

Encryption’s use case is also well summarized in the same stack overflow thread as referenced before.

“Encryption is for maintaining data confidentiality and requires the use of a key (kept secret) in order to return to plaintext.”

Caesar Cipher

One of the oldest and simplest forms of encryption can be found in the Caesar cipher. This simple method of encryption works by shifting the letters in the alphabet to the left or the right by some number of places.

For example, below is a Caesar cipher with a left shift of three. It’s considered a left shift because that is the direction required to undo the encryption.

Plain:  A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
Cipher: X|Y|Z|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W

In the above case the direction and number of places shifted can be considered the key (L3). Using this ciphered alphabet we can encrypt a message.

Plaintext:  THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG 
Ciphertext: QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD

The Caesar cipher is named after Julius Caesar, who would actually use the cypher to protect messages of military significance.

Symmetric vs Asymmetric Encryption

Nowadays, we have more advanced encryption techniques. The two most significant types of encryption out there today are AES (which stands for, Advanced Encryption Standard) and RSA (which stands for, Rivest-Shamir-Adleman (the creators names)).

AES is a symmetric key algorithm meaning the same key is used to encrypt and decrypt messages. Whereas, RSA is an asymmetric key algorithm meaning one key (a public key) is needed to encrypt data and another key (a private key) is needed to decrypt data.

There’s a fundamental problem with trying to send symmetrically encrypted messages (aka messages that only require one key to encrypt & decrypt) over an insecure channel. The problem is, how do you establish a shared secret key to begin with?

For example, say we want send a secret message using the Caesar cipher (which is a symmetric key algorithm). We encrypt our message with a left shift of 17 and then send off the cyphertext to our friend. If our friend does not know the shift amount and direction, aka if they do not know the shared secret key, then they cannot easily decrypt the message.

But how can we securely send our friend the secret key over an insecure channel? Such a problem is considered intractable and getting around it lead directly to the invention of asymmetric public key cryptography.

With asymmetric encryption two people can communicate securely without needing to share a secret key. Under such a system two participants can communicate securely as long as both have their own asymmetric private / public key pair.

The way it works is that I send you my public key which you then use to encrypt your message and send it back to me. I then get your message and decrypt it using my private key.

Often times asymmetric encryption is initially used to establish a secure connection over which a symmetric shared secret key can be sent securely. The shared secret key can then be used to establish a symmetric key conversation which often requires less computational overhead.

Hashing

A hash is a fixed length unique* and irreversible digital signature representing a particular piece of data. These digital signatures can be used to verify data integrity. For example, in file transfer a file can be hashed before and after the transfer to make sure all of the bits got to the destination in the right order.

Hashing’s use case is also well summarized in [the same stack overflow thread] as referenced before.

“Hashing is for validating the integrity of content by detecting all modification thereof via obvious changes to the hash output.”

Another application of hashing is password authentication. If you create a service that users log into you never want to store user’s passwords (even if they’re stored in an encrypted format)! Instead what you store is the user’s password hash. Then when the user goes to log in you hash the password they supplied and compare that to the hash you have on file.

SHA-2 (aka sha256 & sha512)

One of the most common hashing function around today is SHA-2 (Secure Hashing Algorithm 2). Many things today are hashed using sha256 (64 characters long) or sha512 (128 characters long) and an implementation of it is included in the OpenSSL TLS security library. You can find lots of SHA-2 generators online like the one linked below.

https://passwordsgenerator.net/sha256-hash-generator/

You can also use the command line tool `sha256sum` to get the checksum of a particular file.

sha256sum file.txt 
2ee40d7d819041f9acea0d3ba3fbea01837d3c6021552f3a46cf063ed574ec24  file.txt

MD5

An older hashing function that is no longer considered secure is MD5. MD5 produces a 128-bit hash and was once widely used. However, as of 2008 MD5 was designated “cryptographically broken unsuitable for further use.” -Some Dude at CMU. This was due to MD5 not being very resistant to hash collisions.

A hash collision is when two unique strings of data hash to the same value. Hash collisions can be leveraged to gain access to accounts, fake a file checksums, and fake the validity of SSL certificates among other things.

Additionally, there are many hashed md5 strings that now exist in rainbow tables or lists of pre-hashed passwords and their corresponding plain text values. The classic online MD5 rainbow table is crackstation.net but there are many more.

Public Key Fingerprints

A public key fingerprint is a shortened form of a much longer public key. The public key fingerprint is generated by hashing the full public key value. This way a unique fixed length identifier for the long form public key can be created and used to ensure the full key you have is accurate.

We can use the command below to request a server's public ssh key.

ssh-keyscan -t rsa www2.g22.pair.com
www2.g22.pair.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC10E3fqB0KfiQJ8klwlOAEi8+79vchyc8xegmeOnGTcwsIfxBcD6NtAnC018g1U0Z0ewTL3iiu6WoaH36Zw20Vv1sU/CJXCiTnezOM47hCGd4Axpv6j2tUAh0I8Zmf4CJJI1Qd54y0rQn2qzRV5gCow8RdB1t1uH8raKzlJ59DSZqBzsCzOnbLltSP7c9J+vyRfuURkb0C7PdI90OGgLmZ4+eRFBcOaoRihnrKe0+XqNLu4ndscG50iQ+rO5O3NncxJTOLOb6aUMjLpW2WTTnflai9ALUQhcRuevKjYVJoWrt9p8vdRr5B2eLzxR1i1AVwCPWkeepqdt0HfBHzrWr3

As you can see its very long. So instead we can use the command below to generate the sha256 hash (aka the public key fingerprint) of the full public key.

ssh-keygen -E sha256 -lf <(ssh-keyscan -t rsa www2.g22.pair.com) | awk '{print $2}' | cut -d: -f2
# www2.g22.pair.com:22 SSH-2.0-OpenSSH_8.8
23bAFQVFbO3ZAusDt2HphM0MAGu0BC+Em/WRzMEi8fU

We can compare this fingerprint against the sha256 fingerprint stored on the server to verify the server's identity. You can run the command below on the server to get the server's public key fingerprint.

ssh-keygen -E sha256 -lf /etc/ssh/ssh_host_rsa_key.pub | awk '{print $2}' | cut -d: -f2
23bAFQVFbO3ZAusDt2HphM0MAGu0BC+Em/WRzMEi8fU

As you can see the two fingerprints are identical, so we know the server is who it says it is.

Key fingerprints are often encoded in Hexadecimal. We can use the pipeline below taken from this stackexchange post to output the hex value of our ssh key fingerprint.

ssh-keyscan -t rsa www2.g22.pair.com | awk '{print $3}'| base64 -d | sha256sum -b | awk '{print $1}' | sed 's/../&:/g' | sed 's/:$//' | fold -w 62
# www2.g22.pair.com:22 SSH-2.0-OpenSSH_8.8
db:76:c0:15:05:45:6c:ed:d9:02:eb:03:b7:61:e9:84:cd:0c:00:6b:b4
:04:2f:84:9b:f5:91:cc:c1:22:f1:f5

Sources

https://en.wikipedia.org/wiki/Base64

https://en.wikipedia.org/wiki/Binary-to-text_encoding

https://en.wikipedia.org/wiki/Character_encoding

https://en.wikipedia.org/wiki/Cipher

https://stackoverflow.com/questions/4657416/difference-between-encoding-and-encryption

https://en.wikipedia.org/wiki/Caesar_cipher

https://en.wikipedia.org/wiki/Advanced_Encryption_Standard

https://en.wikipedia.org/wiki/RSA_(cryptosystem)

https://en.wikipedia.org/wiki/Symmetric-key_algorithm

https://en.wikipedia.org/wiki/Public-key_cryptography

https://en.wikipedia.org/wiki/SHA-2

https://en.wikipedia.org/wiki/MD5

https://en.wikipedia.org/wiki/Hash_collision

https://en.wikipedia.org/wiki/Rainbow_table

https://en.wikipedia.org/wiki/Public_key_fingerprint