• ylai@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The novel bit of this project is actually the usage of GGML quantization from llama.cpp for Stable Diffusion, which can offer lower RAM usage and faster inference on CPU than all the previous CPU implementations without the benefit of low bit quantization, which was known to make CPU and low RAM LLaMA inference feasible.

    The important long term implication is that people have been targeting the incorrectly sized Stable Diffusion model, if the goal is quality on commodity hardware (this includes GPU, too). For example, Stable Diffusion where Stability AI has gloated so much how it fits commodity hardware is slightly less than 1 billion parameters. The smallest LLaMA that people nowadays can happily run on commodity GPU or CPU is already 7 billion parameters. And even OpenAI’s DALL·E 2, which many called prohibitive because “you need a 48 GB GPU” (which is not true, with quantization), is just 3.5 billion parameters.

    For additional context, Stable Diffusion using CPU has been done before, though with repurposed frameworks rather than a custom C++ project. Notably, there has been a Q-Diffusion paper (https://github.com/Xiuyu-Li/q-diffusion), but the result was obtained by simulating the quantization, and e.g. the GitHub repo not actually offer an implementation with actual speed-up.