FotoForensics turns 14 years old today. (Wow!) With over 8.5 million unique uploads, including nearly 900,000 just last year, this project is more successful than I ever expected.
While the public FotoForensics service is focused on pictures, my commercial offerings include support for other types of media, including video. As deepfake and AI detection is really hitting critical mass, I focused my attention last year on my video analyzers. And let me be clear: when it comes to forensics, video just sucks. I mean, compared to video, individual JPEG or PNG files are easy to evaluate. Video introduces a lot of additional complexities:
- Quality: Quality determines how well the artifacts are retained. A low quality or viral picture is harder to evaluate than a high quality camera-original photo. But video? A high quality video is usually lower quality than a low quality photo.
- Resaves: With pictures, an original or near-original (high-quality resave) are sometimes available. Moreover, if you ask for a picture from a customer (e.g., banking or insurance), then the picture better be high quality, otherwise the investigator will ask: Where's the original?
With videos, re-encoding or transcoding (converting formats) is the norm, not the exception. I rarely ever see high-quality original videos. (Real cameras rarely record in "high quality", even if the user selects high quality for the output.) When I do see a high-quality "original", it's almost guaranteed to be straight from an AI generating system.
However, the biggest problem comes from the lack of repeatability and determinism.
Warning: This blog is a technical deep dive into FFmpeg and video encoding. For 99% of people, they just want to encode or play a video; these are non-issues. FFmpeg works as designed. These inconsistencies only become important when you're hyper-focused on tiny changes inside large files.
One Small Square
Most video software uses FFmpeg for any kind of video evaluation or forensic analysis. FFmpeg has the benefit of being widely available and supports almost any kind of audio or video codec you might encounter.
I have an implementation of my Error Level Analysis (ELA) system for video. ELA depends on re-encoding. If the recompression changes from run to run, then it can impact the ELA results. In one test, a small ELA square lit up (just a little), but when I re-ran the test, the ELA square was gone. At first, I thought it was a fluke in my code. Then I traced down the problem and found that the inconsistency was built into FFmpeg: FFmpeg is inconsistent by design. The core problem is that FFmpeg has been optimized for speed over consistency.
Non-Repeatable
Beginning with FFmpeg 5.x (January 2022), results from converting videos became non-deterministic. If you run "ffmpeg -i input_444p.mp4" -pix_fmt yuv420p output_420p.mp4" twice, and there are any conversions (like changing the color scheme from yuv444p to yuv420p, scaling the video size, or just a video with large dimensions) then output1.mp4 and output2.mp4 could be different. The reason is that FFmpeg introduced parallelism for speed.
With any kind of parallel processing, the workload is divided up and sent to different processing threads. If a single task takes 10 minutes, then dividing it between two worker threads means it can finish in about 5 minutes. The more parallel threads, the faster the work can be completed.
The apparent randomness comes from when the different threads finish. Given a task that is divided between multiple threads, then the final results becomes shuffled and non-deterministic -- there is no repeatability. With video encoding, these small changes can impact the file structure and potentially the encoded pixel values.
Video Encoding
But let's back up a bit and consider how a video is encoded:
- Videos are encoded using one of many known algorithms (codecs). H.264 (MPEG-4) is usually what we think up when talking about ".mp4" files. However, a newer format, H.265 (HEVC) is gaining traction. (H.265 is more efficient at compressing video compared to H.264, but they share most of the same problems.)
- Regardless of the algorithm, videos consist of a series of frames. The most common frame types are I-frames (key frames that encode the entire frame), P-frames that partially update the frame based on the prior frame's content (P-frames usually require far fewer bytes than I-frames, so they compress really well), and B-frames that interpolate between two frames.
- P-frames include a list of smaller components ("macroblocks" in H.264 terminology, or coding tree units / CTU for H.265). The idea is that most frames don't need to render everything. If you have a video of a newscaster talking, then the background barely moves. Most of the movement is around their face, so the P-frames just encode the changes to the face.
- B-frames are a lot like P-frames, where they only encode changes. However, they interpolate between a previous anchor (previous I, P, or B frame) and a future anchor. (The "B" stands for bidirectional, although newer B-frames are "Bi-predictive" since they can use two past or two future frames.)
A video is often encoded as a series of these frames, such as "IPPPIPPPIPPP..." or "IBBPBBIBBPBBI...". The actual encoding depends on the video format.
Non-Deterministic Details
When encoding a video, the encoder takes a frame and decides what needs to be stored:
- For an I-frame, it's easy: store the entire frame. However, the frame doesn't have to be stored in a raster. Instead, the frame can be segmented and encoded in parallel. The first segment that finishes gets encoded first. (Without parallelism, encoding a 4K or 8K video stream can take seconds per frame.)
- For a P- or B- frame, the encoder needs to compare the current frame against other frames in order to determine what macroblocks to encode. This is where parallelism offers another performance benefit. Rather than using one process to compare the entire frame, they can subdivide the frame between parallel processes. The first process to complete gets encoded first.
- If you have enough parallel threads, then you can even process different frames at the same time. However, this can cause B-frames to generate slightly different output depending on how and when the anchor frame data becomes available.
Any kind of conversion uses slicing to divide up the frames. Each slice gets sent to a different core for parallel processing. The decision for using B-frames or P-frames and the various macroblocks can depend on when each independent core's data becomes ready. This means that the structural layout can be different each time.
Really technical:: Even if the overall frame types, like IBBPBBI are consistent, the encoding of the macroblocks could vary. The biggest problem seems to be in libswscale. Depending on the CPU load and how the OS schedules those threads, a pixel on the border of a slice might round "up" in one run and "down" in another. This changes the bitstream of the raw video before it hits the encoder. A single bit change can have a butterfly effect, impacting the macroblocks used by subsequent P- and B-frames.
Built-in Randomness
The inconsistent structural layout impacts how the data is encoded. While the visual content will be similar enough for a human, it will be different for a forensic analyzer.
But it gets worse: For processing, FFmpeg 6.x introduced a random number seed to the H.264 and H.265 video encoders. (It was introduced to improve rate-control stability in multi-threaded environments.) This effectively guarantees non-determinism. The FFmpeg developers addressed this in FFmpeg 7.x, where you can add in a "deterministic=1" parameter. (But 7.x also added in more multi-threaded non-determinism.)
Unfortunately, the inconsistencies don't stop there: If you're using hardware acceleration (e.g., the h264_nvenc driver for NVIDIA, h264_vaapi for Intel and AMD, h264_videotoolbox for Apple Video, h264_amf for AMD/ATI GPUs, etc.) then you will never get deterministic results, even if you use the "deterministic=1" option. This is because hardware encoders are built for speed and have non-deterministic scheduling baked into the silicon itself. Adding to this problem, there can be differences between hardware driver versions, hardware generation, GPU clock/turbo condition. Even changes in the non-GPU system load can cause subtle changes that impact the hardware between runs.
Improving Consistency
For forensic work, you want to minimize the variability and generate repeatable results. The first step toward achieving this is to always use the software encoders (libx264 or libx265). For deterministic, repeatable forensic use something like:
ffmpeg -threads 1 -i input.mp4 [parameters] -c:v libx264 -x264-params "deterministic=1" -flags +bitexact output.mp4
This says:
- Only use one thread, regardless of the number of CPU cores that are available.
- Read from the input video (
-i input.mp4). - Use the software H.264 encoder (
-c:v libx264). The software encoders try to be deterministic by default. - Turn off the random seed (
deterministic=1). This is the default for most versions of FFmpeg's software encoders. (For testing, use the software encoder, omit-threads 1, and usedeterministic=0to see the non-deterministic output. Or use a hardware encoder.) - Ensure that the container-level metadata (like the encoder version string and timestamps) doesn't create a mismatch. (The
-flags +bitexact, or older-bitexact) - Store the output as "output.mp4".
The good news is, this makes the results deterministic and repeatable. (Perfect for forensic use.) The bad news is that all of the speed enhancements are gone; encoding a video becomes significantly slower.
Getting determinism out of H.265 is more difficult. (Among other things, libx265 often ignores "-threads 1" and spawns its own "worker pools".) For consistency, you need to use the software encoder (libx265, not hardware encoding), make it single-threaded, disable wavefront parallel processing (WPP), and disable background worker pools:
ffmpeg -i input.mp4 -c:v libx265 -x265-params "threads=1:pools=none:wpp=0" output.mp4
Of course, H.265 is computationally expensive, so repeatability comes with an extreme speed penalty. You might only be able to encode 1-2 frames per second on a high-end CPU.
However, even this may not really be deterministic. For example, if you encode the same file on an Intel CPU using AVX2 and then on an ARM processor (like an Apple M2), the results may still differ. This is because the floating-point math or SIMD optimizations can vary slightly between CPU architectures. Reproducibility is usually only guaranteed when using the exact same hardware and library version.
To achieve bit-identical results across different hardware architectures (e.g., Intel vs. AMD vs. Apple Silicon), you must eliminate the variation introduced by CPU-specific SIMD optimizations (assembly). With libx265, that means disabling the hardware-specific assembly code:
ffmpeg -i input.mp4 -c:v libx265 -x265-params "threads=1:pools=none:wpp=0:asm=0" output.mp4
Using "asm=0" is the nuclear option. While this is consistent across platforms, you're now talking about serious speed issues.
| Encoding | Performance | Consistency |
|---|---|---|
| Normal hardware and multi-threaded processing | Fast. Often faster than real-time. A 1-minute video can usually be re-encoded in seconds. | Inconsistent. Encoding the same file twice can produce different trace artifacts. |
| Forensic consistency through single-threaded use and software-only libraries | Slow. Small videos may encode near real-time, but 4K/8K videos often encode at half-speed or slower. A 1-minute 4K video may take around 10 minutes. | Consistent on the same hardware platform, but results may differ across CPU architectures. |
Architecture parity with asm=0 | Extremely slow. On a high-end system, a 4K video (3840×2160) may encode 1 frame every 10 seconds. A 1-minute video could take 4 hours. | Consistent across platforms. Bitstreams should match even across different CPU architectures. |
(Even with asm=0, the encoding will likely be different if you use different library versions.)
As an aside: Figuring this out took time. I spent a half-day identifying the non-determinism, a week to track down the extent of the problem, and another week to figure out all of the parameters needed make the results deterministic.
Other Options
There are other options for reducing the amount of non-determinism. For example, you can limit the frame types. In particular, B-frames introduce lookahead dependency chains, reordering, and motion estimation that can look both forward and backward.
Excluding B-frames (-x264-params bframes=0) removes one of the biggest causes of nondeterminism, but not everything.
With P-frames only, the H.264 (software and hardware) encoder uses each thread to search different macroblock regions. The race timing can affect:
- Which reference candidates get tested first
- Which motion vector is accepted
- Early termination behavior in search loops
Small timing differences results in different bitstreams.
The final decoded visual output can differ slightly, even with no B-frames, when the encoder is nondeterministic (multithreaded or hardware-accelerated). The differences are usually very small -- often a handful of pixels or tiny motion-vector variations -- but they can occur. Hardware encoders always produce small run-to-run variations.
Et tu, decoder?
All of this non-determinism impacts the encoder. But what about the video decoder?
The H.264 and H.265 specifications (MPEG/ITU-T) are written so that any compliant decoder must produce the exact same YUV pixel values from a given bitstream. This is vital for the inter-frame dependency chain. If this didn't happen, then a small change in one decoded frame could propagate through the video, causing more and more corruption before the next I-frame refreshes the video output.
The problem comes from hardware decoders. Decoders should be deterministic by standard, but tolerances give hardware manufacturers some wiggle room:
- Hardware calculations: Some hardware decoders use chips with different internal precisions for calculations like an inverse discrete cosine transform (IDCT). While they must stay within a defined tolerance to be compliant, two different GPU architectures might produce a 1-bit difference in a single pixel value.
- Rounding errors: Some optimized decoders use SIMD (Single Instruction, Multiple Data) sets like AVX or NEON to speed up the math. If one implementation uses floating-point math and another uses fixed-point approximation for speed, the final pixel values might vary by a negligible amount (+/- 1 on an 8-bit scale).
- Multithreaded frame vs Slice decoding: FFmpeg's libavcodec can decode frames in parallel. While this is designed to be deterministic, a corrupted bitstream can cause different threading configurations to handle the error concealment differently. This can cause different-looking "glitches" on the screen.
When forensically evaluating a video, you need to first decode it. If the detector is sensitive enough to identify pixel-level differences, then different hardware decoders can cause two analysts to evaluate the file differently.
The Impact
Typical users just want to view a video. Any nondeterministic variances are very likely to be unnoticed. (The video looks like the video.) Most users watching a video, see the same content, same motion, and same colors. They will never notice tiny pixel-level variations introduced by any nondeterminism.
But when highlighting trace artifacts, these subtle differences can create big discrepancies. The small differences may in appear different ways:
- A motion vector differing by (1,0) instead of (0,1).
- A DCT coefficient being +1 instead of 0.
- Slight differences in quantization noise.
- Tiny changes in P-frame prediction residuals.
These small differences can dramatically alter forensic trace artifacts, such as:
- Inter-block correlation patterns,
- Quantization staircase signatures,
- Subpixel interpolation residue,
- Compression noise consistency,
- Error Level Analysis (ELA) heat maps,
- High-frequency energy distribution,
- Block boundary artifacts,
- Sensor pattern noise (PRNU) visibility,
- H.264 prediction inconsistencies,
I know it sounds like minutia (because it is!), but these are the exact signals used by deepfake/AI-generation detectors, re-encoding tamper detection, camera model fingerprinting, consistency analysis, scaling and upsampling detection, content-based steganography decoding, invisible watermark detection, and more. Changes to these trace artifacts could be the difference between detecting a video as being AI-generated, digitally altered, unaltered, watermarked, etc. It directly impacts forensics.
For forensic analysis, hardware-encoded H.264 and H.265 should never be used in a pipeline where trace artifacts matter.
As for the analyst, how can you tell what your forensic software is using? Easy: how fast is it? If it can process a 20-second video from an iPhone in under 20 seconds, then it's using hardware acceleration. If it take 2x to 10x longer to do the analysis, then it's probably single threaded, software-driven, and deterministic.
Back to Forensics
Every year, I try to add something new to FotoForensics. While I've worked out these consistency issues for video, the performance just isn't fast enough for use on the public site. (If a one-minute video takes 10 minutes to process, then it's not running in real-time. If a hundred users upload videos, then that's unrealistic for my servers to process in an hour.) I have a few fast video algorithms that support video formats (and that don't rely on FFmpeg or hardware acceleration), but all of the video heat-map algorithms, including ELA, are really too slow for public deployment. Right now, only a few high-end commercial customers have access to these video analyzers. If I can work out some of the speed issues, then I might be able to make them more widely available.
Despite the challenges, this research has been some of the most intellectually rewarding work I've done. And like every year, none of this would be possible without help from my mental support group (including One-armed Bill the Donut Assassin, Bob not Bill, Dave and his twelve pianos), my totally technical support group (Marc, G Mark, Richard, Nelle, Wendy, and everyone else), Joe, Joe, Joe, AXT, the Masters and their wandering slaves, Madcat, BeamMeUp,
, Lou, Loris, and The Boss. I don't know where I'd be without their advice, support, assistance, and feedback. And most importantly, I want to thank the literally millions of people who have used FotoForensics and helped make it what it is today.