Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

Video Package Format

MP4, AVI, and MKV are all suffixes for local video files and are used in Windows to indicate which application the operating system should use to open them. In the streaming media world, these are known as ''video encapsulation formats'' because, in addition to the audio and video streams, they also contain some ancillary information and a way of organizing the video and audio. The user experience of videos in different formats on different platforms varies largely due to the differences brought by the organization of audio and video. A video format is an identifier given to a video file by video playback software to be able to play the video file. In short, the video format specifies the communication protocol with the player.

The "video packaging format" is a "packaging" of the encoded video and audio with playback-related protocol data.

The "video protocol" is for web streaming media, and some articles will classify the "video protocol" as a "video encapsulation format". Both "video protocol" and "video encapsulation format" carry audiovisual content and metadata, as well as other information required by the protocol/format. Taking FFMpeg as an example, it does not distinguish between video formats and video protocols; however, with GStreamer, you may need to specify the "video protocol", but it does not distinguish the "video encapsulation format".

Video streaming

In terms of video streams, you must have heard terms like ''h264 stream,'' ''yuv stream,'' ''encoded stream,'' ''decoded stream,'' ''raw stream,'' ''raw stream,'' ''compressed stream,'' or ''uncompressed stream'' and so on in your daily life. In general, when mentioning "video streams," there are only two forms:

1. Encoded streams (such as h264, yuv);

2. Raw streams (whether compressed or uncompressed).

Uncompressed stream data, when decoded, is referred to as ''raw stream.'' One can imagine a video as being composed of a series of continuous ''images'' over time, and because the internal ''images'' of the video are in the ''YUV'' format, it is often also referred to as a ''YUV stream.''

Stream data that has been compressed by compression algorithms is referred to as ''encoded stream,'' and due to the prevalence of the H.264 compression/encoding algorithm, it is also commonly known as ''H.264 stream.''

Summary: "H.264 stream," "encoded stream," and "compressed stream" refer to the compressed/encoded video streams, while "YUV stream," "decoded stream," and "uncompressed stream" denote the video streams that have not been compressed/encoded. ''Raw stream'' is a term with ambiguity, dependent on context, which can refer to either the former or the latter.

In daily life, the vast majority of video files we encounter are encoded/compressed, and in network transmission scenarios, the majority are also encoded/compressed. During video playback, viewers see a decoded video stream frame by frame converted to RGB.

Encoding/compression is a very important technology in the field of streaming media: the process from an H264 bitstream to a YUV stream is called decoding, and the reverse is called encoding.

Frame format

In the field of streaming media, "streaming" is important, and the basic element of "frames" is equally important. The reason is that for video encoding/compression, its core is to encode a series of temporally continuous frame data into as small a space as possible; whereas for video decoding, it is to try to restore a series of encoded/compressed frame data to its original state as much as possible.

Coding/compression algorithms that can be recovered by 100% are called lossless compression and vice versa are called lossy compression.

"A "frame" can be associated with an "image" that we normally see, except that the images we normally encounter are in RGB format, whereas video frames are usually in YUV format.

With the maximum compression rate achieved, it is possible to minimize the distortion perceived by the human eye. In the three channels of "YUV", "Y" represents the brightness (Lumina nce or Luma), that is, the gray scale value, while "U" and "V" represent the chrominance (Chrominance or Chroma). Scientists have found that the human eye has the lowest sensitivity to UV, so it can greatly compress the values of the two channels of UV. See video codec learning a YUV format.

For forward compatibility with black and white TVs. This involves historical reasons, the author very much recommends zero-based entry audio and video development. In history, when proposing video frame formats, there were suggestions to use RGB, but the real reason for ultimately deciding to use YUV is this.

YUV sampling: "YUV444" "YUV422" "YUV420"

When you see YUV followed by a string of numbers, ''YUV'' no longer means the base of the color space, but a sample of the original ''YUV stream''.

444, 422, and 420 are the three ''YUV'' (which in digital circuits refers to YCbCr) samples, with the three digits representing the sampling ratios for the Y\U\V (Y\Cb\Cr in digital circuits, same as later in this paragraph) channels. So it can be understood that 444 is full sampling; while 422 is full sampling of Y and 1/2 uniform sampling of U\V respectively.

A frame is a rectangle of pixels, e.g. a 4x4 image is made up of 16 pixels. In the usual "RGB" images, each pixel must be composed of at least three channels, R\G\B (and in some cases, an alpha component), and the value of each component is usually [0,255], which is [2^0, 2^8], so it is often said that a pixel occupies 3 bytes (if there are other components, such as RGBA, then it is a different story). The same applies to a "YUV" image, which consists of Y\U\V for each pixel.

Taking the 4X4 image as an example, the following figure corresponds to YUV444 sampling, i.e. full sampling, and you can see in the illustration that the Y\U\V channel in each pixel is preserved, and in general, YUV444 is too large, so it is seldom used.

Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

The figure below corresponds to YUV422 sampling, which is two pixels adjacent to each scan line or row.

Only the U \ V component of 1 pixel is taken. In addition, it can be calculated that each pixel occupies 2/3 of the original size.

Therefore, YUV422 is 2/3 the size of YUV444. The Y components of two adjacent pixels share the reserved U \ V components.

Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

The figure below shows YUV420 sampling, which is: YUV422 sampling per line is performed in intervals, i.e., 2 neighbouring pixels take the U\V component of only 1 pixel; the next line discards all the U\V components. In addition, it can be calculated that each pixel occupies 2/2 of the original size. Therefore, YUV420 is 1/2 the size of YUV444. The method of recovering the U/V components is similar to YUV422, except that here the 2x2 matrix shares and retains the U/V components.

Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

This design approach is clever! As mentioned earlier, "the human eye is the least sensitive to UV, so it is possible to greatly increase the sensitivity to UV.

Proportionally compressing the values of the UV channels, and for images, the colors and saturation of adjacent pixel regions are generally very close, so it is reasonable to use a 2x2 matrix as the basic unit and retain only 1 set of U/V components.

Storage format of YUV

There are two storage formats for YUV:

The planar plane format, meaning that the Y-component of all pixel points is stored consecutively first, then the U-component, and finally the V-component.

The packed mode means that the Y, U, and V components of each pixel are continuously and alternately stored.

Depending on the sampling method and storage format, there are several YUV formats. These are mainly based on:

YUV4: 2:2 and YUV4: 2:0 sampled

Common formats based on YUV4: 2:2 sampling are shown in the following table:

Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

Common formats based on YUV4: 2:0 sampling are shown in the following table:

Basic Knowledge of Video Codec Technology From Video Formats to Frame Formats

It is found that there is another storage method, Semi-Planar mode, which is estimated to be used less, so it is often ignored in many articles. The interleaved format stores Y/U/V components together, while the planar format strictly separates Y/U/V components. The semi-planar format is in between, with Y component stored separately and U/V components stored interleaved.

Common frame nouns

Frame Rate (FPS)

"Frame rate", FPS, full name of Frames Per Second. It refers to the number of frames transmitted per second, or the number of frames displayed per second. Generally speaking, the "frame rate" affects the smoothness of the picture and is proportional: the higher the frame rate, the smoother the picture; the lower the frame rate, the more dynamic the picture. A more authoritative statement: When the video frame rate is not lower than 24fps, the human eye perceives the video as continuous, known as the "persistence of vision" phenomenon. Therefore, there is a saying: Although a higher frame rate results in smoother motion, in many practical applications, 24fps is sufficient.


"Resolution," is also commonly referred to as the "dimension of an image" or the "size of an image." It refers to the number of pixels contained in one frame of an image, commonly seen in specifications such as 1280x720 (720P), 1920x1080 (1080P), and so on. "Resolution" affects the size of the image and is proportional to it: the higher the resolution, the larger the image, and vice versa, the smaller the image.

Code rate (BPS)

"Code rate", BPS, full name is Bits Per Second. Refers to the number of bits of data transmitted per second, commonly measured in kilobits per second (KBPS) and megabits per second (MBPS). This concept really needs to be explained clearly:

There are different opinions on the Internet. Some people think that "the code rate is proportional to the volume: the greater the code rate, the greater the volume; the smaller the code rate, the smaller the volume.". Another saying is "bit rate" is bigger, that is, the sampling rate per unit time is bigger, the data stream precision is higher, which shows the effect is: the video picture is clearer and higher quality". Some people say that "Bit rate" is "distortion". However the authors didn't quite understand before why a larger data transfer per second necessarily corresponds to a sharper picture. What about volume?

Here is the relationship between frame rate, resolution and bit rate:

Ideally, the clearer and smoother the picture is the better. However, in practical applications, it is also necessary to combine the processing power of the hardware, the actual bandwidth conditions to choose. A high frame rate and high resolution means a high bit rate, which requires high bandwidth and powerful hardware for codecs and image processing. So ''frame rate'' and ''resolution'' should depend on the situation."Frame rate, resolution, and compression all affect bit rate.

Soft and hard coding

Essentially there is no difference, both use chips to perform codec calculations.Differences: different underlying interfaces, different instruction sets, different hardware drivers

1. Soft and hard coding

Softcoding: Encoding using the CPU

Hard coding: Encoding using non-CPU, e.g. graphics card GPU, dedicated DSP, FPGA, ASIC chips, etc.

2. Comparison of soft and hard coding

Soft coding: direct and simple to implement, easy to adjust parameters, easy to upgrade, but heavy CPU load, lower performance than hard coding.

Quality is usually better than hardcoded at low bit rates.

Hard encoding: high performance, usually lower quality than soft encoders at low bit rates, but some products are ported to GPU hardware platforms.

3. Process difference

Hardware soft coding:read(ffmpeg)->decoder(NVIDIA cuvid)->encoder(ffmpeg)

Software hard coding:read(ffmpeg)->decoder(ffmpeg)->encoder(ffmpeg)

Software hard coding:read(ffmpeg)->decoder(ffmpeg)->encoder(NVIDIA cuvid)