AI

Comparing Face Swap Models: BlendSwap, Ghost, InSwapper, SimSwap, and UniFace

Comparing Face Swap Models: BlendSwap, Ghost, InSwapper, SimSwap, and UniFace

FaceFusion and similar tools offer several AI face swap models, each with different architectures, resolutions, optimizations, and performance characteristics. In this deep-dive, we compare nine face-swap models commonly used in FaceFusion (Face Swapper | FaceFusion):

  • blendswap_256
  • ghost_1_256, ghost_2_256, ghost_3_256
  • inswapper_128 & inswapper_128_fp16
  • simswap_256 & simswap_unofficial_512
  • uniface_256

We’ll explore each model’s architecture and design, input/output resolution, GPU optimization (like FP16 support), memory footprint, inference speed, and the typical visual quality of results. We’ll also note performance trade-offs and scenarios where one model might outperform others – such as motion consistency in video, lighting consistency, facial detail preservation, identity fidelity, or overall image realism.

Table of Contents

Overview of Face Swap Architectures

Most one-shot face swap models follow a similar two-step pipeline: first, extract the identity of the source face, and second, generate a new face image that transfers that identity onto the target face’s pose and setting. This typically involves a face identity encoder (often an existing face recognition model like ArcFace) and a generator network that blends the identity into the target face region. The swapped face is then composited back into the original image/video frame of the target.

General Face Swap Pipeline: The diagram below illustrates the common data flow for models like SimSwap, InSwapper, and Ghost, which use an identity encoder plus a generator. The source face provides identity features, the target face provides pose/background features, and the generator produces a swapped face that is then blended into the target frame.

flowchart TD
    SourceFace([Source Face]) -->|extract identity| IDEncoder(Identity Encoder<br/>(e.g. ArcFace))
    TargetFace([Target Face]) -->|extract features| Generator(Face Swap Generator)
    IDEncoder -->|identity code| Generator
    Generator --> SwappedFace[Swapped Face (face region)]
    SwappedFace -->|blend into original frame| OutputFace[Final Output Face]

Figure: General pipeline for one-shot face swapping. An identity encoder (e.g. ArcFace) provides a source identity embedding, which the generator network uses along with the target face to produce a swapped face region, finally blended into the target image. (Changelog | FaceFusion)

While the overall approach is similar, the implementation details of each model differ significantly. For example, some models use a standard encoder-decoder (U-Net) generator with feature injection, while others use more complex warping or StyleGAN-based architectures. Some rely on half-precision (FP16) to speed up inference or reduce VRAM. Below, we profile each model in detail.

Model Profiles

BlendSwap 256

Architecture: BlendSwap 256 is based on the BlendFace model from ICCV 2023 (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). It combines an AEI-Net style generator with a novel “BlendFace” identity encoder. The BlendFace encoder is a redesigned face recognition network that mitigates identity embedding biases, enabling more faithful identity transfer (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). Essentially, BlendSwap replaces the standard ArcFace identity embedding with BlendFace’s embedding for improved identity similarity in swaps. The generator itself follows the approach of previous models like FaceShifter’s AEI-Net – an encoder-decoder network that integrates the source identity features into the target face representation.

Input/Output Resolution: As the name suggests, BlendSwap operates on 256×256 pixel face crops. Source and target faces are aligned (source aligned in ArcFace format, target in FFHQ format) before feeding the model (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). The output is a 256×256 swapped face region that then must be scaled/positioned back onto the original image.

Notable Features: By using the BlendFace identity encoder, BlendSwap is particularly strong in preserving the source identity fidelity even under challenging scenarios. BlendFace was shown to better recognize a swapped face as the source person (where ArcFace would falter) (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). The model likely retains the target’s background and accessories via a face mask, blending the generated face with the original context (hence the “blend” in the name). This means lighting and background details from the target are preserved while the inner face is swapped.

Optimization & Performance: The BlendSwap ONNX model is relatively lightweight (~93 MB file), corresponding to on the order of 20–30 million parameters (much smaller than some others). This makes it memory-friendly and quite fast in inference. Even without explicit FP16 weights, it runs efficiently on modern GPUs. It’s a good default choice when both speed and identity accuracy are needed. BlendSwap is a newer model, so it may not have as much community feedback yet, but its design suggests a strong balance of quality and performance.

Visual Quality: BlendSwap 256 typically produces sharp and high-fidelity face swaps. Thanks to the improved identity encoder, the swapped face strongly resembles the source person (avoiding the “mask-like” effect where the result looks like the target wearing makeup of the source). At the same time, the generator architecture (inherited from FaceShifter) helps maintain the target’s facial pose, expression, and lighting naturally. Overall realism is high, with smooth blending at the face boundary. Use cases include high-quality image swaps or video swaps where identity preservation is critical. It may especially outperform others when the source and target have somewhat different appearances, since the identity embedding is robust.

Ghost 256 Models (v1, v2, v3)

Architecture: Ghost 1, 2, 3 (256) are a series of custom face-swap models introduced in FaceFusion 3.x as the “GHOST” models (Changelog | FaceFusion). They are likely variants of an encoder-decoder U-Net architecture with progressively larger capacity from v1 to v3. The “Ghost” name hints that they might leverage GhostNet-style convolutional layers (i.e. ghost modules) for efficiency, though details are not formally published. Each Ghost model uses an ArcFace identity embedding (FaceFusion includes an arcface_converter_ghost.onnx to feed the Ghost generator (facefusion/models-3.0.0 at main) (facefusion/models-3.0.0 at main)). Thus, the pipeline is similar to SimSwap/Inswapper: ArcFace extracts a 512-d identity code which is injected into the Ghost generator network.

The primary difference between ghost_1_256, ghost_2_256, and ghost_3_256 is model size and capacity. Each higher version has more parameters (and larger ONNX file: ~340 MB for v1, ~515 MB for v2, ~739 MB for v3 (facefusion/models-3.0.0 at main) (facefusion/models-3.0.0 at main)) – indicating a deeper or wider network for potentially better quality. All operate at 256×256 resolution.

Input/Output Resolution: Ghost models take aligned face crops at 256×256 resolution and output a swapped 256×256 face region. They rely on ArcFace (usually 112×112 input to ArcFace IR-50 model) for identity, but the target face passed into Ghost is 256×256. Since they are trained on 256px faces, they preserve reasonably fine facial details and textures.

Notable Features: The “Ghost” models were likely developed to provide higher quality face swaps without relying on closed-source models. They might incorporate skip connections and residual blocks (common in U-Net) to better blend the target and generated features, aiming for improved motion and lighting consistency. Because they come in multiple sizes, users can choose ghost_1 for speed or ghost_3 for quality. The inclusion of dedicated ArcFace converters for Ghost (Changelog | FaceFusion) suggests careful integration of identity features into the network (possibly an injection layer or adaptive instance normalization tuned for ArcFace output).

Optimization & Performance: Ghost_1_256 is the smallest of the trio – it should be faster and lighter on VRAM. Ghost_3_256 is much larger and can be expected to consume significantly more GPU memory and run slower, but with potentially higher fidelity output. None of the Ghost models are provided in half-precision by default (only FP32 ONNX), but advanced users could convert them to FP16 or run with mixed precision on supporting runtimes. Given Ghost1’s moderate parameter count (tens of millions) and the fact that ghost models likely exploit ghost conv modules (which reduce computation by generating “ghost” feature maps cheaply), Ghost_1 can approach real-time performance. Ghost_2 and _3 will be progressively slower; ghost_3 might be unsuitable for live applications but fine for offline processing where quality is paramount.

Visual Quality: Ghost_3_256 aims for the highest visual quality among the Ghosts – it should produce very realistic swaps with detailed facial features (skin texture, hairline, etc.) and strong identity capture, thanks to its capacity. It may handle lighting and pose variations well, since a larger model can learn more complex mappings. Ghost_1_256, while faster, may produce slightly softer or less accurate results under challenging conditions (its smaller network might blur some detail or partially blend identities to avoid artifacts). Ghost_2_256 strikes a balance. These models are well-suited to cases where open-source and higher resolution swaps are needed but InsightFace’s higher-res model isn’t available. In video, Ghost models do not inherently enforce temporal consistency, but a larger model (ghost_3) might produce more stable identity frame-to-frame simply by virtue of being more accurate per frame. They are a solid choice for quality-focused image swaps, or for video if one can afford the processing time.

Note: The Ghost models are relatively new, and early community feedback indicates that while they work, the Inswapper model still had an edge in overall output quality in many cases (AI换脸:FaceFusion大更新,包含很多实用功能! – 托尼不是塔克) (possibly due to Inswapper’s excellent training). Ghost models continue to improve with each version, so this may change as they evolve.

InSwapper 128 (FP32/FP16)

Architecture: InSwapper 128 originates from the InsightFace project as a “one-click” face swap model. It is essentially InsightFace’s implementation of the FaceShifter approach: it uses the ArcFace recognition model (from the InsightFace library’s buffalo_l pack) to get a source identity embedding, and a generator network that maps the target face to the swapped face using that embedding (insightface/examples/in_swapper/README.md at master · deepinsight/insightface · GitHub). The architecture is very similar to that of SimSwap or FaceShifter – an encoder-decoder that integrates identity features (likely via concatenation or AdaIN) to replace the target identity while keeping other attributes. InsightFace did not publish a separate blending network for post-processing in the 128 model, so InSwapper’s generator directly outputs the final face (relying on a face mask to composite it back).

Importantly, InSwapper was trained on 128×128 face images (insightface/examples/in_swapper/README.md at master · deepinsight/insightface · GitHub), which is relatively low resolution by today’s standards. This was a conscious choice to make it lightweight and real-time for applications like live video calls. The model has a high capacity (to preserve identity at that small scale), as evidenced by the large ONNX file (~554 MB FP32) which corresponds to roughly 130 million parameters.

Input/Output Resolution: The model expects 128×128 aligned face crops as input (both source and target faces are aligned and resized). It outputs a swapped face at 128×128. In practice, after obtaining the 128px swapped face, FaceFusion or other tools will upscale or blend it into the original frame. Because 128px limits fine detail, it’s common to follow InSwapper with a face enhancer (like GFPGAN, CodeFormer, or GPEN) to restore detail at higher resolutions (Which is the best realistic faceswapper currently : r/StableDiffusion).

Optimization – FP16 Variant: FaceFusion provides an inswapper_128_fp16 model, which is simply the half-precision version of the FP32 weights. Running in FP16 yields the same resolution output, but uses half the memory and often runs faster due to tensor core acceleration. Indeed, FaceFusion’s default face swapper is inswapper_128_fp16 (Face Swapper | FaceFusion), meaning out-of-the-box it prioritizes this optimized model. The FP16 model shows negligible quality loss compared to FP32 in practice, but improves speed and allows larger batch or higher throughput on GPUs that support FP16.

Performance: InSwapper 128 is known for its excellent performance. By keeping the spatial size low, it runs extremely fast – on a modern GPU, it can achieve real-time frame rates (20-30+ FPS at 1280×720 video) with the FP16 model. Its memory usage is modest; even though the model has ~130M parameters, 130M floats at FP16 is only ~260 MB in VRAM, plus some overhead for activations. Users have successfully run InSwapper on consumer GPUs without issue. This performance is one reason why many live face swapping apps and tools (FaceFusion, Rope, Reactor, etc.) rely on InSwapper by default (Which is the best realistic faceswapper currently : r/StableDiffusion).

Visual Quality: Despite the low resolution, InSwapper 128 is often regarded as producing the most convincing and robust swaps overall. One reason is its strong training – InsightFace leveraged their expertise in face recognition to ensure the identity transfer is very accurate. In fact, many users report that alternatives like SimSwap at higher resolution “have much lower output quality than inswapper_128” (Which is the best realistic faceswapper currently : r/StableDiffusion). InSwapper’s outputs have very high identity fidelity (the swapped face looks recognizably like the source) and generally good blending with the target’s features. Because it outputs a 128px face, the result can appear slightly blurry or lacking fine details – but this is usually remedied by the post-processing enhancers. When enhanced with something like GPEN (FaceFusion includes GPEN 1024/2048 models for upscaling (Which is the best realistic faceswapper currently : r/StableDiffusion)), InSwapper swaps can look photorealistic at high resolutions. Another strength is stability in video: since the identity embedding is consistent and the network is simpler, frame-to-frame jitter is minimal and any remaining jitter can be smoothed by FaceFusion’s reference-frame tracking.

Use Cases: InSwapper_128_fp16 is ideal for real-time applications or video processing where speed is crucial but one doesn’t want to sacrifice too much quality. It excels at producing a believable swap with correct expressions and lighting (it inherently keeps the target’s expression/pose, and color-matches reasonably well). It might be outperformed by higher-res models in terms of micro-detail on individual frames, but for consistent, reliable face swaps, InSwapper remains a top choice. Indeed, as one blogger summarized, “from actual use, the best results are still from Inswapper; the others are just there to make up the numbers” (AI换脸:FaceFusion大更新,包含很多实用功能! – 托尼不是塔克).

SimSwap 256 & SimSwap 512 (Unofficial)

Architecture: SimSwap (256) is an academic model introduced in 2020 (ACM MM 2020) that allows arbitrary face swapping with one trained model (GitHub - neuralchen/SimSwap: An arbitrary face-swapping framework on images and videos with one single trained model!). It uses a UNet-like generator with an ID injection module to infuse the source’s identity features into the target face representation (SimSwap: An Efficient Framework For High Fidelity Face Swapping). The identity features come from ArcFace (the authors use an ArcFace model trained on MS1M as the identity encoder). SimSwap’s key innovation was a Weak Feature Matching Loss that encourages the generator to preserve the target image’s non-face details (background, hair, etc.) (SimSwap: An Efficient Framework For High Fidelity Face Swapping). In effect, SimSwap tries to only swap the identity, not everything else, to improve realism. The architecture is one-stage (no separate blending network); it directly outputs the swapped face composite.