|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "AI rig from scratch II: OS, drivers and stress testing" |
| 4 | +date: 2025-06-16 10:00:00 +0100 |
| 5 | +categories: development |
| 6 | +comments: true |
| 7 | +--- |
| 8 | + |
| 9 | +# AI rig from scratch II: OS, drivers and stress testing |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | +## Introduction: Bringing the beast to life |
| 14 | + |
| 15 | +In the [first part of this series](https://jordifierro.dev/ai-rig-from-scratch-1), |
| 16 | +we carefully selected our components and assembled the hardware for my new AI rig. |
| 17 | +Now, with the physical build complete, it's time for the crucial next phase: |
| 18 | +installing the operating system, ensuring all components are correctly recognized, |
| 19 | +setting up the necessary drivers, and most importantly, verifying that our |
| 20 | +cooling system can handle intense AI workloads. |
| 21 | + |
| 22 | +Let's get this machine ready to crunch some numbers! |
| 23 | + |
| 24 | +## Step 1: Operating system and initial BIOS configuration |
| 25 | + |
| 26 | +Choosing an OS for a headless AI server is a key decision. I chose **Ubuntu Server** |
| 27 | +for several reasons: it's stable, has extensive community support, and is |
| 28 | +widely used in the AI/ML world. Its command-line interface is perfect |
| 29 | +for a server that will be accessed remotely. |
| 30 | + |
| 31 | +To start, I downloaded the latest ISO from the |
| 32 | +[official website](https://ubuntu.com/download/server) and used their |
| 33 | +[step-by-step tutorial](https://ubuntu.com/tutorials/install-ubuntu-server#1-overview) |
| 34 | +to create a bootable USB drive. |
| 35 | + |
| 36 | +With the USB stick ready, I connected it to the rig along with a monitor, |
| 37 | +keyboard, and an Ethernet cable, and hit the power button for the first time. |
| 38 | +The installation process was straightforward. I mostly followed the defaults, |
| 39 | +with a few key selections: |
| 40 | + |
| 41 | +* I attempted to install third-party drivers, but none were found at this stage. |
| 42 | +* I included **Docker** and **OpenSSH** in the initial setup, as I knew I would |
| 43 | + need them later. |
| 44 | + |
| 45 | +Once the installation finished, I removed the USB drive and rebooted. |
| 46 | +The system came alive with a fresh OS. The first commands are always the same: |
| 47 | + |
| 48 | +```bash |
| 49 | +sudo apt update && sudo apt upgrade |
| 50 | +``` |
| 51 | + |
| 52 | +Before diving deeper into the software, I rebooted and pressed the DEL key |
| 53 | +to enter the BIOS. There were two critical settings to adjust: |
| 54 | + |
| 55 | +**RAM profile**: I enabled the AMD EXPO I profile to ensure my |
| 56 | +Patriot Viper Venom RAM was running at its rated speed of 6000MHz. |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | +**Fan curve**: I switched the fan settings from "Silent" to "Standard" |
| 61 | +to prioritize cooling over absolute silence, which is a sensible |
| 62 | +trade-off for a high-performance machine. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +After saving the changes and exiting the BIOS, the foundational setup was complete. |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +## Step 2: Establishing connectivity (Wi-Fi and remote access) |
| 71 | + |
| 72 | +My plan is to place the rig in a convenient spot, which means I'll be relying |
| 73 | +on Wi-Fi instead of an Ethernet cable. On a server, setting up Wi-Fi |
| 74 | +requires a few manual steps. |
| 75 | + |
| 76 | +First, I confirmed the Wi-Fi driver was loaded correctly by the kernel. |
| 77 | + |
| 78 | +```bash |
| 79 | +# First, ensure core network tools are present |
| 80 | +sudo apt install wireless-tools |
| 81 | + |
| 82 | +# Check for a wireless interface (e.g., wlan0 or, in my case, wl...) |
| 83 | +ip link |
| 84 | +lspci -nnk | grep -iA3 network |
| 85 | +dmesg | grep -i wifi |
| 86 | +``` |
| 87 | + |
| 88 | +The output confirmed the `mt7921e` driver for my motherboard's Wi-Fi chip |
| 89 | +was active. With the driver in place, I just needed to connect to my network |
| 90 | +using `network-manager`. |
| 91 | + |
| 92 | +```bash |
| 93 | +# Install network-manager |
| 94 | +sudo apt install network-manager |
| 95 | + |
| 96 | +# Scan for available networks |
| 97 | +nmcli device wifi list |
| 98 | + |
| 99 | +# Connect to my home network (replace with your SSID and password) |
| 100 | +nmcli device wifi connect "Your_SSID" password "your_password" |
| 101 | + |
| 102 | +# Test the connection |
| 103 | +ping -c 4 google.com |
| 104 | + |
| 105 | +# Set the connection to start automatically on boot |
| 106 | +nmcli connection modify "Your_SSID" connection.autoconnect yes |
| 107 | +``` |
| 108 | + |
| 109 | +With the rig now on my local network, I enabled SSH to allow remote connections. |
| 110 | + |
| 111 | +```bash |
| 112 | +sudo systemctl enable ssh |
| 113 | +sudo systemctl start ssh |
| 114 | +``` |
| 115 | + |
| 116 | +Now I could disconnect the monitor and keyboard and access the rig from my laptop! |
| 117 | +To take remote access a step further, I installed Tailscale, a fantastic tool |
| 118 | +that creates a secure private network (a VPN) between your devices. After signing |
| 119 | +up and following the simple instructions to add my rig and laptop, I could SSH |
| 120 | +into my machine from anywhere, not just my local network. |
| 121 | + |
| 122 | +## Step 3: Verifying hardware and thermals |
| 123 | + |
| 124 | +With the OS running, it was time to confirm that all our expensive components |
| 125 | +were recognized and running correctly. The BIOS gives a good overview, |
| 126 | +but we can double-check from the command line. |
| 127 | + |
| 128 | +```bash |
| 129 | +# Check CPU info |
| 130 | +lscpu |
| 131 | + |
| 132 | +# Check RAM size |
| 133 | +free -h |
| 134 | + |
| 135 | +# List all PCI devices (including the GPU) |
| 136 | +lspci -v |
| 137 | +``` |
| 138 | + |
| 139 | +Everything looked good. Next, I checked the component temperatures at idle |
| 140 | +using `lm-sensors`. |
| 141 | + |
| 142 | +```bash |
| 143 | +sudo apt install lm-sensors |
| 144 | +sensors |
| 145 | +``` |
| 146 | + |
| 147 | +This revealed an issue. While most temps were fine, one of the SSD sensors |
| 148 | +was running hot. |
| 149 | + |
| 150 | +Initial Idle Temps (Before adding extra fan): |
| 151 | + |
| 152 | +```bash |
| 153 | +amdgpu-pci-0d00 |
| 154 | +Adapter: PCI adapter |
| 155 | +vddgfx: 719.00 mV |
| 156 | +vddnb: 1.01 V |
| 157 | +edge: +48.0°C |
| 158 | +PPT: 20.10 W |
| 159 | + |
| 160 | +nvme-pci-0200 |
| 161 | +Adapter: PCI adapter |
| 162 | +Composite: +51.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C) |
| 163 | +Sensor 1: +70.8°C (low = -273.1°C, high = +65261.8°C) <-- This is too high for idle! |
| 164 | +Sensor 2: +51.9°C (low = -273.1°C, high = +65261.8°C) |
| 165 | +Sensor 3: +51.9°C (low = -273.1°C, high = +65261.8°C) |
| 166 | + |
| 167 | +mt7921_phy0-pci-0800 |
| 168 | +Adapter: PCI adapter |
| 169 | +temp1: +44.0°C |
| 170 | + |
| 171 | +k10temp-pci-00c3 |
| 172 | +Adapter: PCI adapter |
| 173 | +Tctl: +50.4°C |
| 174 | +Tccd1: +42.4°C |
| 175 | +``` |
| 176 | + |
| 177 | +This is why we test! As mentioned in Part I, I installed an extra |
| 178 | +Arctic P12 Slim fan at the bottom of the case to improve airflow over |
| 179 | +the motherboard. The results were immediate and significant. |
| 180 | + |
| 181 | +```bash |
| 182 | +nvme-pci-0200 |
| 183 | +Adapter: PCI adapter |
| 184 | +Composite: +41.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C) |
| 185 | +Sensor 1: +60.9°C (low = -273.1°C, high = +65261.8°C) |
| 186 | +Sensor 2: +53.9°C (low = -273.1°C, high = +65261.8°C) |
| 187 | +Sensor 3: +41.9°C (low = -273.1°C, high = +65261.8°C) |
| 188 | +``` |
| 189 | + |
| 190 | +Problem solved. The extra 10€ fan was well worth it for the peace of mind. |
| 191 | + |
| 192 | +## Step 4: Installing the NVIDIA driver |
| 193 | + |
| 194 | +The most critical driver for an AI rig is the NVIDIA driver. I used the |
| 195 | +`ppa:graphics-drivers/ppa` repository to get the latest versions. |
| 196 | + |
| 197 | +```bash |
| 198 | +sudo add-apt-repository ppa:graphics-drivers/ppa |
| 199 | +ubuntu-drivers devices |
| 200 | +``` |
| 201 | + |
| 202 | +The tool recommended the proprietary driver, but I found that the open-source |
| 203 | +kernel module version (`-open`) worked best for my setup. |
| 204 | + |
| 205 | +```bash |
| 206 | +== /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0 == |
| 207 | +modalias : pci:v000010DEd00002D04sv00001043sd00008A11bc03sc00i00 |
| 208 | +vendor : NVIDIA Corporation |
| 209 | +driver : nvidia-driver-570 - third-party non-free recommended |
| 210 | +driver : nvidia-driver-570-open - third-party non-free |
| 211 | +driver : xserver-xorg-video-nouveau - distro free builtin |
| 212 | +``` |
| 213 | + |
| 214 | +To install it and prevent conflicts with the default `nouveau` driver, |
| 215 | +I ran the following: |
| 216 | + |
| 217 | +```bash |
| 218 | +# Install the open-source variant of the driver |
| 219 | +sudo apt install nvidia-driver-570-open |
| 220 | + |
| 221 | +# Blacklist the default nouveau driver |
| 222 | +sudo bash -c 'echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist-nouveau.conf' |
| 223 | + |
| 224 | +# Update the initial RAM file system and reboot |
| 225 | +sudo update-initramfs -u |
| 226 | +sudo reboot |
| 227 | +``` |
| 228 | + |
| 229 | +After the reboot, running `nvidia-smi` confirmed the driver was loaded |
| 230 | +and the GPU was ready! |
| 231 | + |
| 232 | +## Step 5: Putting the rig to the test (stress testing) |
| 233 | + |
| 234 | +With everything installed, it was time for the moment of truth. Can the system |
| 235 | +remain stable and cool under heavy, sustained load? I conducted three separate |
| 236 | +stress tests, monitoring temperatures in a separate SSH window using |
| 237 | +`watch sensors` and `watch nvidia-smi`. |
| 238 | + |
| 239 | +### CPU stress test |
| 240 | + |
| 241 | +First, I used `stress-ng` to max out all 8 CPU cores for 5 minutes. |
| 242 | + |
| 243 | +```bash |
| 244 | +sudo apt install stress-ng |
| 245 | +stress-ng --cpu 8 --timeout 300s |
| 246 | +``` |
| 247 | + |
| 248 | +**Result**: The CPU temperature peaked at **73.4°C**. This is a great result, |
| 249 | +showing the AIO cooler is more than capable of handling the Ryzen 7 7700 |
| 250 | +at full tilt. |
| 251 | + |
| 252 | +```bash |
| 253 | +k10temp-pci-00c3 |
| 254 | +Adapter: PCI adapter |
| 255 | +Tctl: +73.4°C |
| 256 | +``` |
| 257 | + |
| 258 | +### SSD stress test |
| 259 | + |
| 260 | +Next, I used `fio` to simulate a heavy random write workload on the NVMe SSD |
| 261 | +for 1 minute. |
| 262 | + |
| 263 | +```bash |
| 264 | +sudo apt install fio |
| 265 | +fio --name=nvme_stress_test --ioengine=libaio --rw=randwrite --bs=4k --size=1G --numjobs=4 --time_based --runtime=60 --group_reporting |
| 266 | +``` |
| 267 | + |
| 268 | +**Result**: The notorious "Sensor 1" heated up to **89.8°C**. While high, |
| 269 | +this is a worst-case scenario, and the drive's critical temperature is even higher. |
| 270 | +The overall `Composite` temperature remained at a healthy **56.9°C**. For my |
| 271 | +use case, this is perfectly acceptable. |
| 272 | + |
| 273 | +```bash |
| 274 | +nvme-pci-0200 |
| 275 | +Adapter: PCI adapter |
| 276 | +Composite: +56.9°C (crit = +79.8°C) |
| 277 | +Sensor 1: +89.8°C |
| 278 | +``` |
| 279 | + |
| 280 | +### GPU stress test |
| 281 | + |
| 282 | +Finally, the main event. I used `gpu-burn` inside a Docker container to push |
| 283 | +the RTX 5060 Ti to its absolute limit. First, I had to set up the |
| 284 | +NVIDIA Container Toolkit. |
| 285 | + |
| 286 | +```bash |
| 287 | +# Setup the NVIDIA Container Toolkit |
| 288 | +distribution=ubuntu22.04 # Workaround for 24.04 |
| 289 | +curl -s -L [https://nvidia.github.io/nvidia-docker/gpgkey](https://nvidia.github.io/nvidia-docker/gpgkey) | sudo apt-key add - |
| 290 | +curl -s -L [https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list](https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list) | sudo tee /etc/apt/sources.list.d/nvidia-docker.list |
| 291 | +sudo apt update |
| 292 | +sudo apt install -y nvidia-docker2 |
| 293 | +sudo systemctl restart docker |
| 294 | +``` |
| 295 | + |
| 296 | +With Docker ready, I cloned the `gpu-burn` repository and ran the test. |
| 297 | + |
| 298 | +```bash |
| 299 | +git clone https://github.com/wilicc/gpu-burn |
| 300 | +cd gpu-burn |
| 301 | +docker build -t gpu_burn . |
| 302 | +docker run --rm --gpus all gpu_burn |
| 303 | +``` |
| 304 | + |
| 305 | +**Result**: Success! The GPU temperature steadily climbed but stabilized |
| 306 | +at a maximum of **72°C** while running at 100% load, processing nearly |
| 307 | +5000 Gflop/s. The test completed with zero errors. |
| 308 | + |
| 309 | +```bash |
| 310 | +100.0% proc'd: 260 (4880 Gflop/s) errors: 0 temps: 72 C |
| 311 | +... |
| 312 | +Tested 1 GPUs: |
| 313 | + GPU 0: OK |
| 314 | +``` |
| 315 | +
|
| 316 | +## Conclusion: We are ready for AI! |
| 317 | +
|
| 318 | +The rig is alive, stable, and cool. We've successfully installed and configured |
| 319 | +the operating system, established remote connectivity, verified all our hardware, |
| 320 | +and pushed every core component to its limit to ensure it can handle the heat. |
| 321 | + |
| 322 | +The system passed all tests with flying colors, proving that our component choices |
| 323 | +and cooling setup were effective. Now that we have a solid and reliable foundation, |
| 324 | +the real fun can begin. In the next post, we'll finally start using this machine |
| 325 | +for its intended purpose: **running and training AI models**. Stay tuned! |
0 commit comments