The tool, called Nightshade, messes up training data in ways that could cause serious damage to image-generating AI models. Is intended as a way to fight back against AI companies that use artists’ work to train their models without the creator’s permission.
ARTICLE - Technology Review
ARTICLE - Mashable
ARTICLE - Gizmodo
The researchers tested the attack on Stable Diffusion’s latest models and on an AI model they trained themselves from scratch. When they fed Stable Diffusion just 50 poisoned images of dogs and then prompted it to create images of dogs itself, the output started looking weird—creatures with too many limbs and cartoonish faces. With 300 poisoned samples, an attacker can manipulate Stable Diffusion to generate images of dogs to look like cats.
I finally found the paper.
The attack is not unique to this program, they cite several works. I haven’t read the cited works but they seem to work along the lines of Carlini and Wagner’s adversarial attack. This uses minor perturbations to manipulate the classifier results.
Here is the method:
Step 3: Constructing poison images {Imagep}. For each text prompt t ∈ {Textp}, locate its natural im- age pair xt in {Image}. Choose an anchor image xa from {Imageanchor}. Given xt and xa, run the optimization of eq. (1) to produce a perturbed version x′ t = xt + δ, subject to |δ| < p. Like [19], we use LPIPS [96] to bound the perturbation and apply the penalty method [46] to solve the optimization: min δ ||F(xt + δ) − F(xa)||2 2 + α · max(LPIPS(δ) − p, 0). (2) Next, add the text/image pair t/x′ t into the poison dataset {Textp/Imagep}, remove xa from the anchor set, and move to the next text prompt in {Textp}.
Yes; they are targeting a single concept C for poisoning by creating a gradient in the training toward a separate, specific concept A.
Study: https://arxiv.org/abs/2310.13828v1