Run a coding assistant entirely on your own machine and the single number that decides everything is VRAM. On a 16GB card like the RTX 5080, you get a hard ceiling: a 27B-class open model at Q4 quantisation fits with a little room to spare, while a 32B coder model needs roughly 20GB and spills past the edge, forcing crushing quantisation or a jump to a 32GB GPU. Knowing where that line sits saves you a lot of trial and error.

Quick Answer

On a 16GB RTX 5080, the largest open coding model that runs cleanly is a 27B model at Q4, which fills the VRAM with a usable context window. A 32B coder at Q4 needs around 20GB and overflows, so it either offloads to system RAM and slows to a crawl or demands heavier quantisation that dulls its code quality. The RTX 5080 is stocked at Evetech for an in-Rand buy.

Why 16GB is the deciding number

A local model has to load its weights into VRAM, and quantisation sets how much space those weights take. Q4 packs each parameter into roughly half a byte, so a 27B model lands near 14 to 15GB, leaving the rest for your prompt and the model's working context. Push to a 32B model at the same Q4 and you are asking for around 20GB before context, which a 16GB card simply cannot hold. The overflow gets shunted to system memory, and generation speed drops from comfortably interactive to painfully slow. The honest answer is that 27B-at-Q4 is the practical ceiling on this card, not 32B.

What actually runs well day to day

For pure coding, you have two sensible routes on a 5080. The first is a smaller, denser coder model at higher precision, a 14B coder at Q8 fits inside roughly 15GB and punches well above its size on real code tasks, often beating a larger model that has been quantised too hard. The second is a mixture-of-experts coder that carries a large total parameter count but only activates a few billion at a time, so it sits inside 16GB while reasoning like something far bigger. Both deliver fast, snappy responses on the 5080's GDDR7 memory rather than the stutter you get from overflow.

If you want the most capable single model that still fits, a 27B at Q4 is the top of the range. Go beyond it and you are paying in speed, not gaining in quality.

The SA angle

Running models locally means your code never leaves your machine, no subscription, no upload, no rate limits, and no rand-to-dollar billing surprise each month. For South African developers that last point matters: a once-off hardware buy in Rand beats an open-ended foreign-currency API bill. The 5080 sits in the sweet spot for this, capable enough for serious 27B-class work yet far cheaper than the 32GB cards. You can see the full local-AI machine lineup in the AI PC range, and the PC best sellers list is a fast read on which builds South African buyers are pairing these cards with right now.

Frequently Asked Questions

Can the RTX 5080 run a 32B coding model?

Not comfortably. A 32B model at Q4 needs around 20GB, which exceeds the 5080's 16GB. It will run by offloading to system RAM, but speed collapses. For 32B-class work without compromise you want a 32GB GPU.

Is Q4 quantisation good enough for coding?

For most everyday coding it is. Q4 trims memory hard while keeping a 27B model genuinely useful. If you need maximum accuracy on tricky logic, a smaller model at Q8 can be the better trade because it keeps more precision per parameter.

Why run a coding model locally at all?

Privacy and cost. Your code stays on your machine, you avoid monthly foreign-currency API charges, and there are no usage caps. The trade is the upfront hardware spend, which a 5080 covers for 27B-class models.

How much context can I fit alongside the model?

On a 16GB card a 27B Q4 model leaves modest room for context. Smaller or mixture-of-experts models free up more VRAM for longer prompts, which helps when you feed in larger files or multi-step tasks.

Want a local coding assistant that runs in Rand, not on a foreign subscription? See the AI PC range at Evetech and build around an RTX 5080 for 27B-class models with room to work.