Comparing Face Swap Models: BlendSwap, Ghost, InSwapper, SimSwap, and UniFace

FaceFusion and similar tools offer several AI face swap models, each with different architectures, resolutions, optimizations, and performance characteristics. In this deep-dive, we compare nine face-swap models commonly used in FaceFusion (Face Swapper | FaceFusion):

blendswap_256
ghost_1_256, ghost_2_256, ghost_3_256
inswapper_128 & inswapper_128_fp16
simswap_256 & simswap_unofficial_512
uniface_256

We’ll explore each model’s architecture and design, input/output resolution, GPU optimization (like FP16 support), memory footprint, inference speed, and the typical visual quality of results. We’ll also note performance trade-offs and scenarios where one model might outperform others – such as motion consistency in video, lighting consistency, facial detail preservation, identity fidelity, or overall image realism.

Table of Contents

Overview of Face Swap Architectures
Model Profiles
Performance Comparison
Comparison Table
Conclusion

Overview of Face Swap Architectures

Most one-shot face swap models follow a similar two-step pipeline: first, extract the identity of the source face, and second, generate a new face image that transfers that identity onto the target face’s pose and setting. This typically involves a face identity encoder (often an existing face recognition model like ArcFace) and a generator network that blends the identity into the target face region. The swapped face is then composited back into the original image/video frame of the target.

General Face Swap Pipeline: The diagram below illustrates the common data flow for models like SimSwap, InSwapper, and Ghost, which use an identity encoder plus a generator. The source face provides identity features, the target face provides pose/background features, and the generator produces a swapped face that is then blended into the target frame.

flowchart TD
    SourceFace([Source Face]) -->|extract identity| IDEncoder(Identity Encoder<br/>(e.g. ArcFace))
    TargetFace([Target Face]) -->|extract features| Generator(Face Swap Generator)
    IDEncoder -->|identity code| Generator
    Generator --> SwappedFace[Swapped Face (face region)]
    SwappedFace -->|blend into original frame| OutputFace[Final Output Face]

Figure: General pipeline for one-shot face swapping. An identity encoder (e.g. ArcFace) provides a source identity embedding, which the generator network uses along with the target face to produce a swapped face region, finally blended into the target image. (Changelog | FaceFusion)

While the overall approach is similar, the implementation details of each model differ significantly. For example, some models use a standard encoder-decoder (U-Net) generator with feature injection, while others use more complex warping or StyleGAN-based architectures. Some rely on half-precision (FP16) to speed up inference or reduce VRAM. Below, we profile each model in detail.

Model Profiles

BlendSwap 256

Architecture: BlendSwap 256 is based on the BlendFace model from ICCV 2023 (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). It combines an AEI-Net style generator with a novel “BlendFace” identity encoder. The BlendFace encoder is a redesigned face recognition network that mitigates identity embedding biases, enabling more faithful identity transfer (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). Essentially, BlendSwap replaces the standard ArcFace identity embedding with BlendFace’s embedding for improved identity similarity in swaps. The generator itself follows the approach of previous models like FaceShifter’s AEI-Net – an encoder-decoder network that integrates the source identity features into the target face representation.

Input/Output Resolution: As the name suggests, BlendSwap operates on 256×256 pixel face crops. Source and target faces are aligned (source aligned in ArcFace format, target in FFHQ format) before feeding the model (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). The output is a 256×256 swapped face region that then must be scaled/positioned back onto the original image.

Notable Features: By using the BlendFace identity encoder, BlendSwap is particularly strong in preserving the source identity fidelity even under challenging scenarios. BlendFace was shown to better recognize a swapped face as the source person (where ArcFace would falter) (GitHub - mapooon/BlendFace: [ICCV 2023] BlendFace: Re-designing Identity Encoders for Face-Swapping https://arxiv.org/abs/2307.10854). The model likely retains the target’s background and accessories via a face mask, blending the generated face with the original context (hence the “blend” in the name). This means lighting and background details from the target are preserved while the inner face is swapped.

Optimization & Performance: The BlendSwap ONNX model is relatively lightweight (~93 MB file), corresponding to on the order of 20–30 million parameters (much smaller than some others). This makes it memory-friendly and quite fast in inference. Even without explicit FP16 weights, it runs efficiently on modern GPUs. It’s a good default choice when both speed and identity accuracy are needed. BlendSwap is a newer model, so it may not have as much community feedback yet, but its design suggests a strong balance of quality and performance.

Visual Quality: BlendSwap 256 typically produces sharp and high-fidelity face swaps. Thanks to the improved identity encoder, the swapped face strongly resembles the source person (avoiding the “mask-like” effect where the result looks like the target wearing makeup of the source). At the same time, the generator architecture (inherited from FaceShifter) helps maintain the target’s facial pose, expression, and lighting naturally. Overall realism is high, with smooth blending at the face boundary. Use cases include high-quality image swaps or video swaps where identity preservation is critical. It may especially outperform others when the source and target have somewhat different appearances, since the identity embedding is robust.

Ghost 256 Models (v1, v2, v3)

Architecture: Ghost 1, 2, 3 (256) are a series of custom face-swap models introduced in FaceFusion 3.x as the “GHOST” models (Changelog | FaceFusion). They are likely variants of an encoder-decoder U-Net architecture with progressively larger capacity from v1 to v3. The “Ghost” name hints that they might leverage GhostNet-style convolutional layers (i.e. ghost modules) for efficiency, though details are not formally published. Each Ghost model uses an ArcFace identity embedding (FaceFusion includes an arcface_converter_ghost.onnx to feed the Ghost generator (facefusion/models-3.0.0 at main) (facefusion/models-3.0.0 at main)). Thus, the pipeline is similar to SimSwap/Inswapper: ArcFace extracts a 512-d identity code which is injected into the Ghost generator network.

The primary difference between ghost_1_256, ghost_2_256, and ghost_3_256 is model size and capacity. Each higher version has more parameters (and larger ONNX file: ~340 MB for v1, ~515 MB for v2, ~739 MB for v3 (facefusion/models-3.0.0 at main) (facefusion/models-3.0.0 at main)) – indicating a deeper or wider network for potentially better quality. All operate at 256×256 resolution.

Input/Output Resolution: Ghost models take aligned face crops at 256×256 resolution and output a swapped 256×256 face region. They rely on ArcFace (usually 112×112 input to ArcFace IR-50 model) for identity, but the target face passed into Ghost is 256×256. Since they are trained on 256px faces, they preserve reasonably fine facial details and textures.

Notable Features: The “Ghost” models were likely developed to provide higher quality face swaps without relying on closed-source models. They might incorporate skip connections and residual blocks (common in U-Net) to better blend the target and generated features, aiming for improved motion and lighting consistency. Because they come in multiple sizes, users can choose ghost_1 for speed or ghost_3 for quality. The inclusion of dedicated ArcFace converters for Ghost (Changelog | FaceFusion) suggests careful integration of identity features into the network (possibly an injection layer or adaptive instance normalization tuned for ArcFace output).

Optimization & Performance: Ghost_1_256 is the smallest of the trio – it should be faster and lighter on VRAM. Ghost_3_256 is much larger and can be expected to consume significantly more GPU memory and run slower, but with potentially higher fidelity output. None of the Ghost models are provided in half-precision by default (only FP32 ONNX), but advanced users could convert them to FP16 or run with mixed precision on supporting runtimes. Given Ghost1’s moderate parameter count (tens of millions) and the fact that ghost models likely exploit ghost conv modules (which reduce computation by generating “ghost” feature maps cheaply), Ghost_1 can approach real-time performance. Ghost_2 and _3 will be progressively slower; ghost_3 might be unsuitable for live applications but fine for offline processing where quality is paramount.

Visual Quality: Ghost_3_256 aims for the highest visual quality among the Ghosts – it should produce very realistic swaps with detailed facial features (skin texture, hairline, etc.) and strong identity capture, thanks to its capacity. It may handle lighting and pose variations well, since a larger model can learn more complex mappings. Ghost_1_256, while faster, may produce slightly softer or less accurate results under challenging conditions (its smaller network might blur some detail or partially blend identities to avoid artifacts). Ghost_2_256 strikes a balance. These models are well-suited to cases where open-source and higher resolution swaps are needed but InsightFace’s higher-res model isn’t available. In video, Ghost models do not inherently enforce temporal consistency, but a larger model (ghost_3) might produce more stable identity frame-to-frame simply by virtue of being more accurate per frame. They are a solid choice for quality-focused image swaps, or for video if one can afford the processing time.

Note: The Ghost models are relatively new, and early community feedback indicates that while they work, the Inswapper model still had an edge in overall output quality in many cases (AI换脸：FaceFusion大更新，包含很多实用功能！ – 托尼不是塔克) (possibly due to Inswapper’s excellent training). Ghost models continue to improve with each version, so this may change as they evolve.

InSwapper 128 (FP32/FP16)

Architecture: InSwapper 128 originates from the InsightFace project as a “one-click” face swap model. It is essentially InsightFace’s implementation of the FaceShifter approach: it uses the ArcFace recognition model (from the InsightFace library’s buffalo_l pack) to get a source identity embedding, and a generator network that maps the target face to the swapped face using that embedding (insightface/examples/in_swapper/README.md at master · deepinsight/insightface · GitHub). The architecture is very similar to that of SimSwap or FaceShifter – an encoder-decoder that integrates identity features (likely via concatenation or AdaIN) to replace the target identity while keeping other attributes. InsightFace did not publish a separate blending network for post-processing in the 128 model, so InSwapper’s generator directly outputs the final face (relying on a face mask to composite it back).

Importantly, InSwapper was trained on 128×128 face images (insightface/examples/in_swapper/README.md at master · deepinsight/insightface · GitHub), which is relatively low resolution by today’s standards. This was a conscious choice to make it lightweight and real-time for applications like live video calls. The model has a high capacity (to preserve identity at that small scale), as evidenced by the large ONNX file (~554 MB FP32) which corresponds to roughly 130 million parameters.

Input/Output Resolution: The model expects 128×128 aligned face crops as input (both source and target faces are aligned and resized). It outputs a swapped face at 128×128. In practice, after obtaining the 128px swapped face, FaceFusion or other tools will upscale or blend it into the original frame. Because 128px limits fine detail, it’s common to follow InSwapper with a face enhancer (like GFPGAN, CodeFormer, or GPEN) to restore detail at higher resolutions (Which is the best realistic faceswapper currently : r/StableDiffusion).

Optimization – FP16 Variant: FaceFusion provides an inswapper_128_fp16 model, which is simply the half-precision version of the FP32 weights. Running in FP16 yields the same resolution output, but uses half the memory and often runs faster due to tensor core acceleration. Indeed, FaceFusion’s default face swapper is inswapper_128_fp16 (Face Swapper | FaceFusion), meaning out-of-the-box it prioritizes this optimized model. The FP16 model shows negligible quality loss compared to FP32 in practice, but improves speed and allows larger batch or higher throughput on GPUs that support FP16.

Performance: InSwapper 128 is known for its excellent performance. By keeping the spatial size low, it runs extremely fast – on a modern GPU, it can achieve real-time frame rates (20-30+ FPS at 1280×720 video) with the FP16 model. Its memory usage is modest; even though the model has ~130M parameters, 130M floats at FP16 is only ~260 MB in VRAM, plus some overhead for activations. Users have successfully run InSwapper on consumer GPUs without issue. This performance is one reason why many live face swapping apps and tools (FaceFusion, Rope, Reactor, etc.) rely on InSwapper by default (Which is the best realistic faceswapper currently : r/StableDiffusion).

Visual Quality: Despite the low resolution, InSwapper 128 is often regarded as producing the most convincing and robust swaps overall. One reason is its strong training – InsightFace leveraged their expertise in face recognition to ensure the identity transfer is very accurate. In fact, many users report that alternatives like SimSwap at higher resolution “have much lower output quality than inswapper_128” (Which is the best realistic faceswapper currently : r/StableDiffusion). InSwapper’s outputs have very high identity fidelity (the swapped face looks recognizably like the source) and generally good blending with the target’s features. Because it outputs a 128px face, the result can appear slightly blurry or lacking fine details – but this is usually remedied by the post-processing enhancers. When enhanced with something like GPEN (FaceFusion includes GPEN 1024/2048 models for upscaling (Which is the best realistic faceswapper currently : r/StableDiffusion)), InSwapper swaps can look photorealistic at high resolutions. Another strength is stability in video: since the identity embedding is consistent and the network is simpler, frame-to-frame jitter is minimal and any remaining jitter can be smoothed by FaceFusion’s reference-frame tracking.

Use Cases: InSwapper_128_fp16 is ideal for real-time applications or video processing where speed is crucial but one doesn’t want to sacrifice too much quality. It excels at producing a believable swap with correct expressions and lighting (it inherently keeps the target’s expression/pose, and color-matches reasonably well). It might be outperformed by higher-res models in terms of micro-detail on individual frames, but for consistent, reliable face swaps, InSwapper remains a top choice. Indeed, as one blogger summarized, “from actual use, the best results are still from Inswapper; the others are just there to make up the numbers” (AI换脸：FaceFusion大更新，包含很多实用功能！ – 托尼不是塔克).

SimSwap 256 & SimSwap 512 (Unofficial)

Architecture: SimSwap (256) is an academic model introduced in 2020 (ACM MM 2020) that allows arbitrary face swapping with one trained model (GitHub - neuralchen/SimSwap: An arbitrary face-swapping framework on images and videos with one single trained model!). It uses a UNet-like generator with an ID injection module to infuse the source’s identity features into the target face representation (SimSwap: An Efficient Framework For High Fidelity Face Swapping). The identity features come from ArcFace (the authors use an ArcFace model trained on MS1M as the identity encoder). SimSwap’s key innovation was a Weak Feature Matching Loss that encourages the generator to preserve the target image’s non-face details (background, hair, etc.) (SimSwap: An Efficient Framework For High Fidelity Face Swapping). In effect, SimSwap tries to only swap the identity, not everything else, to improve realism. The architecture is one-stage (no separate blending network); it directly outputs the swapped face composite.

Under the hood, the SimSwap generator takes the target face image and extracts deep features, then merges in a transformed source identity vector (the ArcFace 512-d embedding is mapped and injected into one of the intermediate layers of the generator). This yields an output image of the same size as the input face image, but with altered identity.

Input/Output Resolution: The official SimSwap model was trained on 224×224 images (common for ArcFace input and their pipeline), but FaceFusion’s simswap_256 indicates a model operating at 256×256. It’s likely an updated or retrained version for 256px output (AI换脸：FaceFusion大更新，包含很多实用功能！ – 托尼不是塔克). The simswap_unofficial_512 is a community-upscaled variant that can produce 512×512 face outputs. Note that the 512 model is unofficial (perhaps an experimental “SimSwap-HQ” the authors hinted at, or a custom training). According to one developer, SimSwap does have a “512px model out there” (Could you publicly release inswapper_256 or inswapper_512? · Issue #2270 · deepinsight/insightface · GitHub) – referring to a beta high-res version. FaceFusion’s inclusion of simswap_unofficial_512 confirms the availability of that model in ONNX form (AI换脸：FaceFusion大更新，包含很多实用功能！ – 托尼不是塔克).

Notable Features: SimSwap is designed to be generalizable – you don’t need to retrain it for new pairs. Its identity injection module ensures that the source face’s identity features are embedded while largely keeping the target face’s structure. The weak feature matching loss means that things like the target’s hair, jawline, skin tone, and lighting tend to remain more intact compared to some other swaps. This avoids the “face cutout” look; instead, the swap is blended in appearance. SimSwap often yields the target person looking like themselves but wearing the source’s face. This can sometimes be too conservative – e.g., some original identity may leak through, making the swap less perfect in identity. But it usually helps realism.

Optimization & Performance: The SimSwap 256 model is moderate in size (~220 MB ONNX, roughly ~50–60 million parameters (simswap_256.onnx · netrunner-exe/SimSwap-models at ...)). It does not have a special FP16 version by default, but it can run in FP16 on supporting inference engines. SimSwap 256 is generally fast – not as fast as InSwapper (due to larger image size and network), but still suitable for video frame processing. On a decent GPU, one can expect on the order of ~10 FPS or more for 256px faces. The SimSwap 512 model has a similar parameter count (it may actually reuse the same weights but operate on larger images). Running it at 512px drastically increases computation (4× the pixels of 256px), so it will be proportionally slower and heavier on VRAM. It’s likely only practical on high-memory GPUs (it could easily use >4 GB memory per face). There is anecdotal evidence that SimSwap 512 yields only a few FPS even on powerful cards – so it’s meant for quality, not speed.

Visual Quality: SimSwap’s outputs are high-fidelity in preserving environment and expression. Thanks to the feature matching, the swapped face typically has the same lighting and skin tone as the target, making the composite seamless. It shines in scenarios where the source and target have different lighting or head pose – SimSwap will naturally keep the target’s lighting/pose and just alter the identity. However, SimSwap sometimes is critiqued for not transferring identity strongly enough, especially if the source and target look somewhat similar originally. Users have reported cases where the result “looks like the person is just wearing a lot of makeup” rather than a different person (Trouble generating realistic-looking face swaps. : r/StableDiffusion) – a sign that the identity injection was on the weaker side to preserve details.

When identity difference needs to be maximal (e.g., swapping two very distinct faces), SimSwap might not be as spot-on as Inswapper or BlendSwap in terms of who the face looks like. Yet, it generally produces very realistic faces – often the limiting factor is identity fidelity rather than image quality. The unofficial 512 model can produce finer details (pores, wrinkles, etc.) making results sharper for high-resolution outputs, but it may also suffer more from identity blending unless it was retrained for high-res.

Use Cases: Use SimSwap 256 for quick swaps where visual seamlessness is more important than a perfect identity match. For example, swapping faces in group photos or artwork – SimSwap will ensure the face blends into the scene nicely. It’s also useful when you want to maintain the target’s facial expressions exactly (it’s quite good at that). The 512 model is useful for photographic detail work – e.g., if you plan to print or zoom into the swapped face, that extra resolution helps (just be aware it’s slower). If one finds SimSwap isn’t capturing the identity enough, supplying a very clear source face or using multiple reference images (if the pipeline allows) can help. SimSwap’s paper demonstrated convincing results on both images and video, and it remains a popular general model.

UniFace 256

Architecture: UniFace 256 (not to be confused with unrelated “Uniface” projects) is based on the ECCV 2022 work “One Unified Framework for High-Fidelity Face Reenactment and Swapping” (GitHub - xc-csc101/UniFace). The authors (Wang et al.) proposed a unified model that can do both face reenactment (driving one face with another’s motion) and face swapping, by disentangling identity and attributes. UniFace’s architecture is quite sophisticated: it employs two encoders – an Identity encoder for the source face, and an Attribute (pose/expression) encoder for the target – plus a StyleGAN-like generator for decoding the final face () ().

Key components include an Attribute Transfer (AttrT) module that warps the source identity features according to the target’s pose landmarks () (), and an Identity Transfer (IdT) module that uses a self-attention mechanism to inject the source identity into the reference (target) features at multiple scales (). The generator uses a StyleGAN2-based architecture (with a “stylemap resizer”) to synthesize a high-fidelity face from the fused features () (). There are skip connections to ensure background and context from the target are preserved (similar to how FaceShifter had a second stage, but here integrated). Essentially, UniFace explicitly handles pose misalignment via feature warping and identity blending via attention.

flowchart TD
    SourceImg(Source Face) --> IdEnc(Identity Encoder)
    TargetImg(Target Face) --> AttrEnc(Attribute Encoder)
    IdEnc --> IdFeat[Identity Feature Maps]
    AttrEnc --> AttrFeat[Attribute (Pose) Features]
    IdFeat --> FusionGen(StyleGAN-based Generator)
    AttrFeat --> FusionGen
    FusionGen --> SwapOut[High-Fidelity Swapped Face]

Figure: Simplified UniFace architecture. The source face’s identity features and target face’s attribute (pose/expression) features are combined in a StyleGAN-like generator to produce a high-fidelity swapped face. (Based on ECCV 2022 Unified Face Reenactment/Swapping framework)

Input/Output Resolution: The UniFace model was trained on 256×256 face images (CelebA-HQ dataset resized to 256) ([PDF] Designing One Unified Framework for High-Fidelity Face ...). In FaceFusion, uniface_256 corresponds to this model, likely converted to ONNX. Input faces must be cropped/aligned to 256px. The output is a swapped face at 256×256 resolution. Because of the StyleGAN backbone, the model can produce very photorealistic details and could potentially be extended to higher resolutions if the style generator is expanded.

Notable Features: UniFace’s claim to fame is tackling both identity and pose challenges simultaneously. Unlike simpler models that might struggle if the source and target faces have very different angles, UniFace explicitly warps the source features to the target pose before swapping () (). This leads to more accurate expression and gaze matching. Its identity transfer module ensures minimal identity leakage from the target – it directly matches identity feature maps via attention, so the output identity is very close to the source (while still keeping target attributes) () (). In effect, UniFace is like an advanced version of FaceShifter combined with some StyleGAN magic. Because it’s a unified model, it can leverage reenactment data to improve swapping (and vice versa), which might improve motion consistency if it was trained on sequences.

Optimization & Performance: With great power comes heavy computation – UniFace 256’s ONNX model is about ~407 MB (around ~100M parameters) (uniface_256.onnx · netrunner-exe/Insight-Swap-models-onnx at main), which is on par with Ghost or a large GAN. It does not have a published FP16 version, but could possibly run in half precision. The StyleGAN-like generator is computationally intensive, involving many convolutional layers at increasing resolutions. Therefore, UniFace is typically slower than the other models for a single face. It might take significantly longer per frame; it’s more suited to quality-first workflows. VRAM usage is also high, since the model needs to hold multi-scale feature maps (including StyleGAN’s latent maps). Running UniFace on GPU likely requires several GB of memory available per face.

Visual Quality: When it comes to image realism and detail, UniFace 256 is among the best. It was designed for high-fidelity output, meaning it tries to preserve fine details (skin texture, wrinkles, etc.) and avoid blurry results. The generated face often has excellent identity fidelity and attribute preservation – in other words, the person in the output looks exactly like the source, and is doing exactly what the target was doing (same expression, angle) (MimicPC - FaceFusion 3.1.0: Advanced Face Swapping with DeepFaceLive) (MimicPC - FaceFusion 3.1.0: Advanced Face Swapping with DeepFaceLive). The blending into the background is also handled by the model’s skip connections, so hair and backgrounds align well. If any model can handle extreme cases (like very different lighting or a very extreme pose difference) elegantly, UniFace is a good candidate due to its warping module and powerful generator.

For video face swapping, UniFace’s architecture hints at potentially less jitter: since it explicitly separates identity and motion, it might keep identity more stable. However, if the target video has quick head movements, the warping module will adapt each frame’s features independently, so some flicker could still occur (and UniFace does not inherently use temporal smoothing). Still, the overall quality per frame is so high that with minor external smoothing, it could produce professional-grade deepfake videos.

Use Cases: Use UniFace 256 when ultimate quality is needed and you have the compute to spare. For example, producing a short film or VFX shot where a character’s face is swapped – UniFace can deliver very realistic results that may need minimal touch-up. It’s also useful when the source and target have very different appearances or head poses, which might trip up simpler models – UniFace’s advanced feature mapping can handle these. The downside is that it’s not as accessible for casual use due to speed; for a long video, it will be slow. But for specific images or critical scenes, it might be worth it.

Now that we’ve profiled each model, let’s compare them across key technical and performance metrics.