Why I Compile My Own Operating System (And What It Taught Me About Enterprise AI)

I am a chartered accountant. I also compile my own operating system from source code and run AI models on an integrated GPU with no dedicated memory.

These two facts are not unrelated.

When I write about enterprise AI strategy — why most projects fail, why audit trails matter, why the model is the least important part of the stack — I am not theorising from articles I have read. I am drawing from systems I have built, broken, and rebuilt. On hardware that most engineers would not consider sufficient.

This is that story.

Part One: Building an Operating System From Source

Gentoo Linux is a source-based distribution. There are no pre-compiled packages. Every component — the kernel, the bootloader, every library, every tool — is compiled from source code on your own machine. A full system build takes hours. Updates require recompilation. Nothing is abstracted away.

I run Gentoo on both my servers. The NAS — which hosts 43 Docker containers, an LDAP directory, NFS and Samba file shares, and a full monitoring stack — runs one Gentoo installation. The AI workstation — a dedicated machine for language model inference — runs another.

Why would anyone choose this? Because it forces you to understand every layer of the system.

When you configure a Linux kernel, you make hundreds of decisions. Which hardware drivers to include. Which filesystems to support. Whether to enable hardware security modules. How to handle memory management. Every choice has consequences, and there is no installer to make them for you.

My kernel configuration enables HSA (Heterogeneous System Architecture) for the AMD GPU, AMDGPU kernel module support, and hardware-accelerated encryption. My USE flags — the mechanism Gentoo uses to control which features are compiled into every package — include ROCm for GPU compute, LTO for link-time optimisation, Secure Boot with module signing, and hardware-specific acceleration libraries like OpenBLAS and XNNPACK.

These are not terms most finance professionals encounter. But they are exactly the terms that determine whether an AI system runs reliably in production or crashes under load. When I evaluate an AI infrastructure proposal, I am not reading the vendor’s marketing material. I am reading the architecture.

Part Two: Running AI on an Integrated GPU

My AI workstation is an AMD Ryzen 7 8700G with 64 gigabytes of RAM. The GPU is an AMD Radeon 780M — an integrated graphics processor that shares system memory with the CPU. It has no dedicated video memory. Not a single megabyte of VRAM that belongs exclusively to the GPU.

Most AI engineers would not attempt to run language models on this hardware. Dedicated GPUs like the NVIDIA A100 have 80 gigabytes of dedicated high-bandwidth memory. My GPU borrows from the same 64 gigabytes that the operating system and every other process is using.

The 8700G was a deliberate purchase. When AMD launched this chip, my thesis was that the VRAM limit on an integrated GPU is artificial. The iGPU shares system RAM with the CPU — the same physical memory, the same bus. The BIOS carves off a small slice and labels it VRAM; the rest is invisible to the GPU. But that boundary is firmware, not physics. If the GPU and CPU are reading from the same silicon, the limit is a software choice, not a hardware constraint.

So I went looking for people who had lifted it. Someone had patched ollama to route GPU memory allocations through GTT — the Graphics Translation Table, which lets the GPU map pages anywhere in host RAM, dynamically, as if they were dedicated VRAM. I ran that patch first. It worked. Then I moved further down the stack and compiled llama.cpp from source with ROCm and GTT allocation enabled. Now the same iGPU that the BIOS claims has 4 GB of VRAM happily serves models sitting in 12 or 14 gigabytes of host memory. The ceiling was never hardware. It was a default.

I run two language model servers on this machine simultaneously. Qwen 2.5, a 3-billion parameter model in full FP16 precision, handles classification and extraction tasks. Phi-4, a larger model quantised to Q5 precision, handles reasoning and text generation. Both served through llama.cpp with ROCm and GTT access — drawing from the full 64 gigabytes, not the BIOS-allocated slice.

This works. It is not fast. But it is reliable, it is entirely on-premise, and it costs nothing beyond the hardware.

The education came from what did not work.

I loaded DeepSeek R1, a 14-billion parameter reasoning model, onto this GPU. The model loaded. Inference began. And then the system became unresponsive. Five-minute timeouts. GPU memory saturated. Every other process on the machine — including the operating system’s display server — starved for memory. I killed the process and learned a lesson that no benchmark paper would have taught me: a model that fits in memory is not the same as a model that runs in production.

That single debugging session taught me more about AI deployment constraints than any conference talk I have attended. The difference between fitting a model and running a model. The difference between peak memory and sustained memory. The difference between a demo that works for one query and a system that serves requests all day.

Part Three: Why This Matters for Enterprise AI

Demos don’t have constraints. Production does.

Enterprise AI will always operate under constraints. Budget constraints — you cannot buy unlimited GPUs. Compliance constraints — data cannot leave the building. Infrastructure constraints — IT must approve the hardware and the security team must approve the architecture. Latency constraints — users will not wait thirty seconds for an answer.

The person who has operated under genuine constraints — who has compiled their own kernel to enable GPU compute, who has quantised models to fit within a memory budget, who has killed a promising model because it destabilised the system — understands these tradeoffs viscerally. Not theoretically. Not from a whitepaper. From the terminal at two in the morning.

This is the gap I see in most enterprise AI discussions. The strategists have never built a system. The engineers have never managed a budget. The executives have seen demos but never production. Someone needs to sit at the intersection of all three.

The Closing Thought

I did not compile my operating system because it was efficient. Gentoo is not efficient. A full rebuild of the AI inference stack takes the better part of a Saturday.

I compiled it because I do not trust what I have not built. I cannot evaluate an AI vendor if I do not understand what a GPU compute pipeline actually does. I cannot assess a model serving architecture if I have never configured one. I cannot advise a board on AI infrastructure costs if I have never run a model on constrained hardware and felt where the bottlenecks are.

The line between a demo and a production system is whether it survives constraints — and whether it can account for how it did. Both of those have to be built.

You do not need a computer science degree to understand systems. You need the willingness to break them and rebuild them yourself. The degree tells you what the textbook says. The rebuild tells you what actually happens.

#AIStrategy #EnterpriseAI #Linux #AIInfrastructure #BuildDontBuy #CFOTech

Part One: Building an Operating System From Source

Part Two: Running AI on an Integrated GPU

Part Three: Why This Matters for Enterprise AI

Leave a ReplyCancel Reply