Guide: Video Encoding, Part 1 - The Basics
Almost every time you pull out your smartphone to record a vlog, sit down in front of a webcam to stream, or roll out a cinema cam to record a music video or commercial, the footage you’re capturing is getting encoded and sometimes more than once. But what exactly does it mean when we talk about videos being encoded?
Video encoding can get really complicated in a hurry and the answer to “what’s best” will almost always be it depends. Before I can dive into the numbers and metrics, this post is to setup all the basics: the terminology and workflow of how the different pieces fit together.
Encoding
Abstractly, this refers to the entire process of writing out a video file
Using a specific scheme (codec)
In a specific way (container)
With specific settings (framerate, color depth, sample rate, etc.)
With or without audio (the audio will have its own specific settings as well: sample rate, bit-depth, channels etc.)
With or without extra metadata (captions, keyword metadata, tags, embedded thumbnails etc.)
Transcoding
A portmanteau for ‘transform and encoding’, transcoding describes an encode where one or more properties of the video are being changed. Two really common reasons to transcode videos are
Changing the video’s resolution to be appropriate for the playback device. This can be downscaling (4K source footage to play back on a 1080 screen) or upscaling (taking 1080 content and sizing it for 4K).
For video compatibility: some phones or TVs won’t recognize certain file formats (containers) or encoding schemes (codecs), so by transcoding, we can change these attributes
Generally speaking, encoding and transcoding can be used interchangeably. Additionally, the term video compression is sometimes thrown into the mix as well since reducing the file size for videos is a big motivating factor for video encoding.
It’s important to note that just because you encode/compress a video, the resulting file is not guaranteed to be smaller. It’s quite possible to get output files that are substantially larger than the original video.
Codec
A portmanteau for coder/decoder, this represents how data in a file is stored. There are a few characteristics that different codecs apart:
Roughly how compressive it is (how small can we make our output file)
What kind of image quality can we get for a given setting
How easy is the encode process (how fast is the encode process)
How easy is it to decode
Picking the ‘best codec’ to use will be a balancing act which I’ll go into deeper in the next guide. Some examples of codecs are H264/AVC, VP9, AV1 and WVC1.
Container
A video file is often more than ‘just the video content’. A modern video file might contain a whole bunch of resources
The video stream data
Zero, one or more streams of data
Zero, one or more sets of subtitle or captioning data
Metadata like thumbnails or display tagging information
All of this data can be packaged up into a single file — the container. Containers are often described by their file extension: MKV, MOV, AVI and MP4 are some common examples.
In general, the container type does not dictate the codec you have to use. More often than not, you’ll want to consider the device that you intend to view the video: if it’s a basic media player box, that box may only recognize certain containers and codecs whereas a computer will generally be able to handle anything you throw at it. A service like YouTube will behave like a computer: almost all pairings of containers and codecs are supported.
Encoders (and Decoders) & Acceleration
Fundamentally, a decoder is just something that allows a player to play a video and an encoder is something that allows a video to be encoded or written using a specific codec. Encoder and decoders can either be general purpose (software) or dedicated (hardware) implementations.
A software implementation will generally work ‘everywhere’ (that the software can run) and how well it works (it’s performance) will be vary depending on the amount of computing power you have at your disposal. Relatively speaking, a software solution will be much more taxing.
A hardware solution will generally highly performant but will require that you have a specific, physical thing. Hardware implementations are binary: either you have the required hardware component (or the right one), or you don’t. A hardware solution will always be faster than a software/CPU solution so the term hardware accelerated will be used to describe this. Often, this hardware component comes bundled with your video card so sometimes it’s described as using the GPU
To make things a bit more confusing, many times, people will refer to tools like Handbrake, StaxRip, Adobe Media Encoder, or ffmpeg as encoders. Technically speaking those are UI front ends that allow users to put together and run encode jobs. The actual encoders are usually hidden away: the user selects a codec and the application knows what encoder that it will use.
Multiplexers and Demultiplexers
I mentioned above that video files are just containers; they hold the audio and video streams, subtitles etc. Before we can begin to encode our videos, we need to be able to ‘open up the container’ — and that’s a demultiplexer/demuxer does. Conversely, at the end of an encode process, we need to ‘put everything back into a container’ — that’s what a multiplexer/muxer does.
Tip: If you just need to change the format of a file (i.e., so that an application or device can recognize it), all you need to do is demux the original file and mux it back into the file format you want. This is infinitely faster than going through an encode process.
Unless you’re encoding your videos using a lossless format (beyond the scope of this guide), every time you encode or transcode, you will lose quality. You will lose image quality even if:
You increase the quality setting of the codec or add more bitrate or,
Switch to a more advanced video codec or,
Switch to a more advanced encoder
The only time you won’t lose image quality is if you transcode to a lossless codec; even then the best you’ll do is the same image quality.
Future tech: there are some really interesting developments with respect to using AI to upscale video. When we encode an upscaled image, we’re no longer comparing apples to apples though (and every encode after that, you’ll still lose quality).
Maintaining maximum quality is not without drawbacks though: sheer disk space - just looking at some of the currently popular (and upcoming) recording modes and the ridiculous amount of data required to capture it all.
Without encoding, any time we press record, we would potentially be moments away from running out of storage! Thankfully virtually every camera has very efficient, built-in encoders. In ideal conditions, compression ratios in the 1:1000 range are possible. It’s because of these built-in hardware encoders that we are able to vlog at 4K using our smartphones!
Even with these astonishing compression ratios coming out of phones and cameras, if you capture a lot of video, the total storage footprint eventually becomes a problem. There are a bunch of obvious scenarios where you need to stay on top of your storage needs:
24x7 security camera footage that you might need to hold on for a certain duration
Dash-cam footage from your fleet of vehicles that you might need to hold on for a certain duration
You’re an active streamer or YouTuber and you want to keep a copy of your videos in case something happens
You film coursework for online learning
You want to get into filmmaking and have a ton of multi-angle/B-roll footage
You like to vlog regularly
This is where we start to look at transcoding our content - to take that one-time (hopefully small!) image quality hit to substantially reduce the storage footprint.
Every codec is a compromise of compression, quality, speed and ease of playback; the impact of these compromises will vary depending on your specific needs. To start,
How much compression do you need?
How long will you keep this video for?
How will the video be consumed now?
How critical is image quality?
What resources do you have available to encode?
The first step is to categorize whether or not GPU encoding is a good fit. There are some easy options:
Streamers: short of a dedicated recording machine, almost every other scenario will likely be best served by GPU encoding
Content creators: this will vary depending on how fast you have to get content delivered and whether or not you will ever go back to reference it later, perhaps as B-roll. In most case, at higher settings, GPU encoding will offer enough compression and performance to justify any hit to quality
Security or dashcam footage: if the priority is getting the footage offsite, then GPU encoding is a no-brainer; the encoding time spend crushing down the file size will be more than paid back on savings uploading the file
CPU vs GPU
CPU encoding will categorically give you a better image quality and almost certainly give you a substantially more compressed file. You also have more codecs to choose from (that don’t need ‘special’ hardware) and as you upgrade your equipment you can keep your existing workflow - everything just runs faster.
GPU encoding though, is stupid-fast. While you won’t get the same amount of compression as CPU encoding, you can still get enough compression that it may not matter. While the image quality will never match CPU encoding, as you work with higher resolution videos, there’s more leeway (quality-wise) in what you can get away with. This ultimately comes back to just how silly-fast the encode process is.
Depending on your specific scenario, the best suggestion is to run some tests that represent your typical convent and your typical workflow to narrow down which codec and encoder makes the best sense.
This is something I’ll cover in much more depth in a future post.