Skip to content

Commit acbc5b7

Browse files
Jordi FierroJordi Fierro
authored andcommitted
Post ai rig 2nd part
1 parent e94cfff commit acbc5b7

File tree

6 files changed

+330
-3
lines changed

6 files changed

+330
-3
lines changed

_posts/2025-06-15-ai-rig-from-scratch-1.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -149,16 +149,18 @@ Tadaaaaa!
149149
![AI rig from scratch](/assets/images/rig_finished_1.jpg)
150150

151151
All that was left was to connect the main power supply, attach the WiFi antennas,
152-
and plug in a temporary screen and keyboard to install Ubuntu Server.
152+
and plug in a temporary screen and keyboard to
153+
start [software installation](https://jordifierro.dev/ai-rig-from-scratch-2).
153154

154155
### Bonus: An extra fan for peace of mind
155156

156-
After installing the OS and running some thermal tests (more on that in the next post!),
157+
After installing the OS and running some
158+
[thermal tests](https://jordifierro.dev/ai-rig-from-scratch-2)
157159
I noticed that one of the SSD sensors was reporting high temperatures.
158160
To improve airflow, I decided to add another slim fan to the bottom of the case.
159161

160162
The magnetic dust filter on the bottom made this incredibly easy.
161-
The Arctic P12 Slim fan even came with a Y-splitter cable, making the connection straightforward.
163+
The **Arctic P12 Slim** fan even came with a Y-splitter cable, making the connection straightforward.
162164
I did have to briefly remove the GPU to access the fan header, but it was no big deal.
163165

164166
![Second bottom fan](/assets/images/rig_second_fan_1.jpg)
Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
---
2+
layout: post
3+
title: "AI rig from scratch II: OS, drivers and stress testing"
4+
date: 2025-06-16 10:00:00 +0100
5+
categories: development
6+
comments: true
7+
---
8+
9+
# AI rig from scratch II: OS, drivers and stress testing
10+
11+
![AI rig back panel](/assets/images/rig_back_detail.png)
12+
13+
## Introduction: Bringing the beast to life
14+
15+
In the [first part of this series](https://jordifierro.dev/ai-rig-from-scratch-1),
16+
we carefully selected our components and assembled the hardware for my new AI rig.
17+
Now, with the physical build complete, it's time for the crucial next phase:
18+
installing the operating system, ensuring all components are correctly recognized,
19+
setting up the necessary drivers, and most importantly, verifying that our
20+
cooling system can handle intense AI workloads.
21+
22+
Let's get this machine ready to crunch some numbers!
23+
24+
## Step 1: Operating system and initial BIOS configuration
25+
26+
Choosing an OS for a headless AI server is a key decision. I chose **Ubuntu Server**
27+
for several reasons: it's stable, has extensive community support, and is
28+
widely used in the AI/ML world. Its command-line interface is perfect
29+
for a server that will be accessed remotely.
30+
31+
To start, I downloaded the latest ISO from the
32+
[official website](https://ubuntu.com/download/server) and used their
33+
[step-by-step tutorial](https://ubuntu.com/tutorials/install-ubuntu-server#1-overview)
34+
to create a bootable USB drive.
35+
36+
With the USB stick ready, I connected it to the rig along with a monitor,
37+
keyboard, and an Ethernet cable, and hit the power button for the first time.
38+
The installation process was straightforward. I mostly followed the defaults,
39+
with a few key selections:
40+
41+
* I attempted to install third-party drivers, but none were found at this stage.
42+
* I included **Docker** and **OpenSSH** in the initial setup, as I knew I would
43+
need them later.
44+
45+
Once the installation finished, I removed the USB drive and rebooted.
46+
The system came alive with a fresh OS. The first commands are always the same:
47+
48+
```bash
49+
sudo apt update && sudo apt upgrade
50+
```
51+
52+
Before diving deeper into the software, I rebooted and pressed the DEL key
53+
to enter the BIOS. There were two critical settings to adjust:
54+
55+
**RAM profile**: I enabled the AMD EXPO I profile to ensure my
56+
Patriot Viper Venom RAM was running at its rated speed of 6000MHz.
57+
58+
![AMD EXPO ram profile](/assets/images/rig_bios_1.jpg)
59+
60+
**Fan curve**: I switched the fan settings from "Silent" to "Standard"
61+
to prioritize cooling over absolute silence, which is a sensible
62+
trade-off for a high-performance machine.
63+
64+
![Fan settings](/assets/images/rig_bios_2.jpg)
65+
66+
After saving the changes and exiting the BIOS, the foundational setup was complete.
67+
68+
![Save changes and exit](/assets/images/rig_bios_3.jpg)
69+
70+
## Step 2: Establishing connectivity (Wi-Fi and remote access)
71+
72+
My plan is to place the rig in a convenient spot, which means I'll be relying
73+
on Wi-Fi instead of an Ethernet cable. On a server, setting up Wi-Fi
74+
requires a few manual steps.
75+
76+
First, I confirmed the Wi-Fi driver was loaded correctly by the kernel.
77+
78+
```bash
79+
# First, ensure core network tools are present
80+
sudo apt install wireless-tools
81+
82+
# Check for a wireless interface (e.g., wlan0 or, in my case, wl...)
83+
ip link
84+
lspci -nnk | grep -iA3 network
85+
dmesg | grep -i wifi
86+
```
87+
88+
The output confirmed the `mt7921e` driver for my motherboard's Wi-Fi chip
89+
was active. With the driver in place, I just needed to connect to my network
90+
using `network-manager`.
91+
92+
```bash
93+
# Install network-manager
94+
sudo apt install network-manager
95+
96+
# Scan for available networks
97+
nmcli device wifi list
98+
99+
# Connect to my home network (replace with your SSID and password)
100+
nmcli device wifi connect "Your_SSID" password "your_password"
101+
102+
# Test the connection
103+
ping -c 4 google.com
104+
105+
# Set the connection to start automatically on boot
106+
nmcli connection modify "Your_SSID" connection.autoconnect yes
107+
```
108+
109+
With the rig now on my local network, I enabled SSH to allow remote connections.
110+
111+
```bash
112+
sudo systemctl enable ssh
113+
sudo systemctl start ssh
114+
```
115+
116+
Now I could disconnect the monitor and keyboard and access the rig from my laptop!
117+
To take remote access a step further, I installed Tailscale, a fantastic tool
118+
that creates a secure private network (a VPN) between your devices. After signing
119+
up and following the simple instructions to add my rig and laptop, I could SSH
120+
into my machine from anywhere, not just my local network.
121+
122+
## Step 3: Verifying hardware and thermals
123+
124+
With the OS running, it was time to confirm that all our expensive components
125+
were recognized and running correctly. The BIOS gives a good overview,
126+
but we can double-check from the command line.
127+
128+
```bash
129+
# Check CPU info
130+
lscpu
131+
132+
# Check RAM size
133+
free -h
134+
135+
# List all PCI devices (including the GPU)
136+
lspci -v
137+
```
138+
139+
Everything looked good. Next, I checked the component temperatures at idle
140+
using `lm-sensors`.
141+
142+
```bash
143+
sudo apt install lm-sensors
144+
sensors
145+
```
146+
147+
This revealed an issue. While most temps were fine, one of the SSD sensors
148+
was running hot.
149+
150+
Initial Idle Temps (Before adding extra fan):
151+
152+
```bash
153+
amdgpu-pci-0d00
154+
Adapter: PCI adapter
155+
vddgfx: 719.00 mV
156+
vddnb: 1.01 V
157+
edge: +48.0°C
158+
PPT: 20.10 W
159+
160+
nvme-pci-0200
161+
Adapter: PCI adapter
162+
Composite: +51.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C)
163+
Sensor 1: +70.8°C (low = -273.1°C, high = +65261.8°C) <-- This is too high for idle!
164+
Sensor 2: +51.9°C (low = -273.1°C, high = +65261.8°C)
165+
Sensor 3: +51.9°C (low = -273.1°C, high = +65261.8°C)
166+
167+
mt7921_phy0-pci-0800
168+
Adapter: PCI adapter
169+
temp1: +44.0°C
170+
171+
k10temp-pci-00c3
172+
Adapter: PCI adapter
173+
Tctl: +50.4°C
174+
Tccd1: +42.4°C
175+
```
176+
177+
This is why we test! As mentioned in Part I, I installed an extra
178+
Arctic P12 Slim fan at the bottom of the case to improve airflow over
179+
the motherboard. The results were immediate and significant.
180+
181+
```bash
182+
nvme-pci-0200
183+
Adapter: PCI adapter
184+
Composite: +41.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C)
185+
Sensor 1: +60.9°C (low = -273.1°C, high = +65261.8°C)
186+
Sensor 2: +53.9°C (low = -273.1°C, high = +65261.8°C)
187+
Sensor 3: +41.9°C (low = -273.1°C, high = +65261.8°C)
188+
```
189+
190+
Problem solved. The extra 10€ fan was well worth it for the peace of mind.
191+
192+
## Step 4: Installing the NVIDIA driver
193+
194+
The most critical driver for an AI rig is the NVIDIA driver. I used the
195+
`ppa:graphics-drivers/ppa` repository to get the latest versions.
196+
197+
```bash
198+
sudo add-apt-repository ppa:graphics-drivers/ppa
199+
ubuntu-drivers devices
200+
```
201+
202+
The tool recommended the proprietary driver, but I found that the open-source
203+
kernel module version (`-open`) worked best for my setup.
204+
205+
```bash
206+
== /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0 ==
207+
modalias : pci:v000010DEd00002D04sv00001043sd00008A11bc03sc00i00
208+
vendor : NVIDIA Corporation
209+
driver : nvidia-driver-570 - third-party non-free recommended
210+
driver : nvidia-driver-570-open - third-party non-free
211+
driver : xserver-xorg-video-nouveau - distro free builtin
212+
```
213+
214+
To install it and prevent conflicts with the default `nouveau` driver,
215+
I ran the following:
216+
217+
```bash
218+
# Install the open-source variant of the driver
219+
sudo apt install nvidia-driver-570-open
220+
221+
# Blacklist the default nouveau driver
222+
sudo bash -c 'echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist-nouveau.conf'
223+
224+
# Update the initial RAM file system and reboot
225+
sudo update-initramfs -u
226+
sudo reboot
227+
```
228+
229+
After the reboot, running `nvidia-smi` confirmed the driver was loaded
230+
and the GPU was ready!
231+
232+
## Step 5: Putting the rig to the test (stress testing)
233+
234+
With everything installed, it was time for the moment of truth. Can the system
235+
remain stable and cool under heavy, sustained load? I conducted three separate
236+
stress tests, monitoring temperatures in a separate SSH window using
237+
`watch sensors` and `watch nvidia-smi`.
238+
239+
### CPU stress test
240+
241+
First, I used `stress-ng` to max out all 8 CPU cores for 5 minutes.
242+
243+
```bash
244+
sudo apt install stress-ng
245+
stress-ng --cpu 8 --timeout 300s
246+
```
247+
248+
**Result**: The CPU temperature peaked at **73.4°C**. This is a great result,
249+
showing the AIO cooler is more than capable of handling the Ryzen 7 7700
250+
at full tilt.
251+
252+
```bash
253+
k10temp-pci-00c3
254+
Adapter: PCI adapter
255+
Tctl: +73.4°C
256+
```
257+
258+
### SSD stress test
259+
260+
Next, I used `fio` to simulate a heavy random write workload on the NVMe SSD
261+
for 1 minute.
262+
263+
```bash
264+
sudo apt install fio
265+
fio --name=nvme_stress_test --ioengine=libaio --rw=randwrite --bs=4k --size=1G --numjobs=4 --time_based --runtime=60 --group_reporting
266+
```
267+
268+
**Result**: The notorious "Sensor 1" heated up to **89.8°C**. While high,
269+
this is a worst-case scenario, and the drive's critical temperature is even higher.
270+
The overall `Composite` temperature remained at a healthy **56.9°C**. For my
271+
use case, this is perfectly acceptable.
272+
273+
```bash
274+
nvme-pci-0200
275+
Adapter: PCI adapter
276+
Composite: +56.9°C (crit = +79.8°C)
277+
Sensor 1: +89.8°C
278+
```
279+
280+
### GPU stress test
281+
282+
Finally, the main event. I used `gpu-burn` inside a Docker container to push
283+
the RTX 5060 Ti to its absolute limit. First, I had to set up the
284+
NVIDIA Container Toolkit.
285+
286+
```bash
287+
# Setup the NVIDIA Container Toolkit
288+
distribution=ubuntu22.04 # Workaround for 24.04
289+
curl -s -L [https://nvidia.github.io/nvidia-docker/gpgkey](https://nvidia.github.io/nvidia-docker/gpgkey) | sudo apt-key add -
290+
curl -s -L [https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list](https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list) | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
291+
sudo apt update
292+
sudo apt install -y nvidia-docker2
293+
sudo systemctl restart docker
294+
```
295+
296+
With Docker ready, I cloned the `gpu-burn` repository and ran the test.
297+
298+
```bash
299+
git clone https://github.com/wilicc/gpu-burn
300+
cd gpu-burn
301+
docker build -t gpu_burn .
302+
docker run --rm --gpus all gpu_burn
303+
```
304+
305+
**Result**: Success! The GPU temperature steadily climbed but stabilized
306+
at a maximum of **72°C** while running at 100% load, processing nearly
307+
5000 Gflop/s. The test completed with zero errors.
308+
309+
```bash
310+
100.0% proc'd: 260 (4880 Gflop/s) errors: 0 temps: 72 C
311+
...
312+
Tested 1 GPUs:
313+
GPU 0: OK
314+
```
315+
316+
## Conclusion: We are ready for AI!
317+
318+
The rig is alive, stable, and cool. We've successfully installed and configured
319+
the operating system, established remote connectivity, verified all our hardware,
320+
and pushed every core component to its limit to ensure it can handle the heat.
321+
322+
The system passed all tests with flying colors, proving that our component choices
323+
and cooling setup were effective. Now that we have a solid and reliable foundation,
324+
the real fun can begin. In the next post, we'll finally start using this machine
325+
for its intended purpose: **running and training AI models**. Stay tuned!

assets/images/rig_back_detail.png

697 KB
Loading

assets/images/rig_bios_1.jpg

343 KB
Loading

assets/images/rig_bios_2.jpg

349 KB
Loading

assets/images/rig_bios_3.jpg

396 KB
Loading

0 commit comments

Comments
 (0)