From demeler at gmail.com Wed Apr 1 15:17:54 2026 From: demeler at gmail.com (Borries Demeler) Date: Wed, 1 Apr 2026 15:17:54 -0600 Subject: [Demelerlab] uslims slow Message-ID: Everyone, our server is again experiencing issues with slowness. Emre and I cannot find the reason. I will have to reboot. Please save your work, log off, and send me an email when done. Thanks, -Borries -------------- next part -------------- An HTML attachment was scrubbed... URL: From demeler at gmail.com Wed Apr 1 15:26:07 2026 From: demeler at gmail.com (Borries Demeler) Date: Wed, 1 Apr 2026 15:26:07 -0600 Subject: [Demelerlab] server slowness Message-ID: I believe I found the culprit of the slowness. Killing all mpi jobs doing AUC analysis restored the regular speed. So us_mpi_analysis appears to be a problem. I did not have to reboot. We should perhaps recompile the mpi libraries and make sure it is all updated. Saeed, can you please take care of that? Let us all know when it is ready to go again, and perhaps Haben, Sophia, Sigang and Reece and retry to do their jobs then to see if it happens again? Thanks, and sorry for the inconveniences. -Borries -------------- next part -------------- An HTML attachment was scrubbed... URL: From saeed.mortezazadeh25 at gmail.com Wed Apr 1 16:19:19 2026 From: saeed.mortezazadeh25 at gmail.com (Saeed Mortezazadeh) Date: Wed, 1 Apr 2026 16:19:19 -0600 Subject: [Demelerlab] server slowness In-Reply-To: References: Message-ID: Sure! I'm going to update it now. Saeed On Wed, Apr 1, 2026, 3:26?p.m. Borries Demeler via Demelerlab < demelerlab at biophysics.uleth.ca> wrote: > I believe I found the culprit of the slowness. Killing all mpi jobs doing > AUC analysis restored the regular speed. So us_mpi_analysis appears to be a > problem. > I did not have to reboot. We should perhaps recompile the mpi libraries > and make sure it is all updated. Saeed, can you please take care of that? > Let us all know when it is ready to go again, and perhaps Haben, Sophia, > Sigang and Reece and retry to do their jobs then to see if it happens again? > Thanks, and sorry for the inconveniences. > -Borries > _______________________________________________ > Demelerlab mailing list > Demelerlab at biophysics.uleth.ca > https://biophysics.uleth.ca/mailman/listinfo/demelerlab > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saeed.mortezazadeh25 at gmail.com Wed Apr 1 16:30:12 2026 From: saeed.mortezazadeh25 at gmail.com (Saeed Mortezazadeh) Date: Wed, 1 Apr 2026 16:30:12 -0600 Subject: [Demelerlab] server slowness In-Reply-To: References: Message-ID: us_mpi_analysis is updated! -Saeed On Wed, Apr 1, 2026 at 3:26?PM Borries Demeler via Demelerlab < demelerlab at biophysics.uleth.ca> wrote: > I believe I found the culprit of the slowness. Killing all mpi jobs doing > AUC analysis restored the regular speed. So us_mpi_analysis appears to be a > problem. > I did not have to reboot. We should perhaps recompile the mpi libraries > and make sure it is all updated. Saeed, can you please take care of that? > Let us all know when it is ready to go again, and perhaps Haben, Sophia, > Sigang and Reece and retry to do their jobs then to see if it happens again? > Thanks, and sorry for the inconveniences. > -Borries > _______________________________________________ > Demelerlab mailing list > Demelerlab at biophysics.uleth.ca > https://biophysics.uleth.ca/mailman/listinfo/demelerlab > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lukas.dobler at uni-konstanz.de Thu Apr 2 03:39:34 2026 From: lukas.dobler at uni-konstanz.de (Lukas Dobler) Date: Thu, 2 Apr 2026 11:39:34 +0200 Subject: [Demelerlab] server slowness In-Reply-To: References: Message-ID: When investigating something else, I noticed that when starting ultrascan applications the following output was present: libGL error: glx: failed to create dri3 screen libGL error: failed to load driver: nouveau I never noticed this before. When running journalctl | grep -iE "nouveau|libGL error"?I noticed that those occure since the last restart of demeler9. When I checked demeler2, I observed the same output, but the journal entries go further back. $ journalctl | grep -iE "nouveau|libGL error" Mar 30 16:18:02 nrch.umt.edu dracut[56122]: -rw-r--r--? ?2 root ?root? ? ? ? ? ? 0 Sep? 4? 2024 etc/modprobe.d/nvidia-installer-disable-nouveau.conf Mar 30 16:18:07 nrch.umt.edu dracut[56122]: -rw-r--r--? ?2 root ?root? ? ? ? ? ?76 Sep? 4? 2024 usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf Mar 30 16:18:07 nrch.umt.edu dracut[56122]: drwxr-xr-x? ?2 root ?root? ? ? ? ? ? 0 Dec 17 21:52 usr/lib/modules/5.15.0-318.199.3.2.el8uek.x86_64/kernel/drivers/gpu/drm/nouveau Mar 30 16:18:07 nrch.umt.edu dracut[56122]: -rw-r--r--? ?1 root ?root? ? ? ?853360 Dec 17 21:52 usr/lib/modules/5.15.0-318.199.3.2.el8uek.x86_64/kernel/drivers/gpu/drm/nouveau/nouveau.ko.xz Mar 30 16:21:12 nrch.umt.edu org.gnome.Shell.desktop[70407]: libGL error: glx: failed to create dri3 screen Mar 30 16:21:12 nrch.umt.edu org.gnome.Shell.desktop[70407]: libGL error: failed to load driver: nouveau Mar 30 16:21:12 nrch.umt.edu org.gnome.Shell.desktop[70407]: libGL error: glx: failed to create dri3 screen Mar 30 16:21:12 nrch.umt.edu org.gnome.Shell.desktop[70407]: libGL error: failed to load driver: nouveau So basically as soon as the driver changes were done, the errors started within 3 minutes. With this change, there is no hardware acceleration anymore for rendering the screen and instead the CPU is doing it. This would especially affect the VNC in addition to the network latency. It also matches the observation that killing the mpi jobs helped to reduce it, because they heavily use CPU and memory. When looking at the running processes, I also noticed that things like gnome-shell would spike, especially if dragging something. Or more notably, if you close a dialog box (for example run details in us_edit) the cpu load spikes after closing, due to the render updates around the desktop manager. btop runs per ssh to not contribute to the load on gnome-shell. What you see is me opening firefox and going to youtube. Where I also can't watch a video currently, youtube says that the browser can't play a video. To verify: - Current NVIDIA driver 580.95.05 is installed (grep "NVIDIA GLX Module" /var/log/Xorg.0.log or nvidia-smi -q) - open-source nouveau driver is blacklisted at kernel level (grep -r "nouveau" /etc/modprobe.d/ /usr/lib/modprobe.d/) - The gpus are not configured to contribute to the display output (nvidia-smi -q | grep -A 2 "Display"?gives "Display Active: Disabled" for all gpus) - This forces the the VNC session to use software rendering (glxinfo -B | grep "OpenGL renderer" returnts llvmpip (software renderer)) - All gnome-shell sessions combined have the thread count of the cpu llvmpipe threads (for the subthreads of gnome-shell ps -T -C gnome-shell | awk '{print $5}' | sort | uniq -c | sort -nr) According to AI the chain of effect is: Why it worked with Nouveau: Nouveau is deeply integrated into the standard Linux kernel and the open-source Mesa graphics stack. It fully supports Kernel Mode Setting (KMS). Because of this deep integration, standard display servers (like Xorg) can automatically detect and initialize nouveau to provide basic 2D and 3D hardware acceleration via standard generic interfaces, even without physical monitors attached or an xorg.conf file present. Why it fails with the Proprietary Driver: The proprietary NVIDIA driver is closed-source and operates outside the standard Linux KMS framework. It strictly relies on its own proprietary modules (glxserver_nvidia). By default, the proprietary driver expects a physical monitor to be connected to initialize a rendering screen. Because your Tesla V100s are headless compute cards, the proprietary driver sees zero monitors. Without a physical monitor, and without an explicit xorg.conf file instructing it to create a "Virtual" off-screen buffer, the NVIDIA driver simply refuses to initialize the display engine. Consequently, Xorg crashes out of the hardware acceleration attempt and falls back to CPU software rendering. From my understanding, the vnc always used software rendering, but with the default driver the defaults around the rendering and especially opengl seem to have prevent this from happening. To verify this, I tested the current main on Konstanz and ASTFVM which had both no gpu related changes, and wasn't able to observe the same issues there. Have a nice day Lukas *Lukas Dobler*, M.Sc. Ph.D. student Universit?t Konstanz AG Prof. C?lfen Fachbereich Chemie Universit?tsstra?e 10, Box 714 78464 Konstanz Raum L 1050 Tel. +49 (0)7531 88 2019 On 02.04.2026 00:30, Saeed Mortezazadeh wrote: > us_mpi_analysis is updated! > -Saeed > > On Wed, Apr 1, 2026 at 3:26?PM Borries Demeler via Demelerlab > wrote: > > I believe I found the culprit of the slowness. Killing all mpi > jobs doing AUC analysis restored the regular speed. So > us_mpi_analysis appears to be a problem. > I did not have to reboot. We should perhaps recompile the mpi > libraries and make sure it is all updated. Saeed, can you please > take care of that? Let us all know when it is ready to go again, > and perhaps Haben, Sophia, Sigang and Reece and retry to do their > jobs then to see if it happens again? > Thanks, and sorry for the inconveniences. > -Borries > _______________________________________________ > Demelerlab mailing list > Demelerlab at biophysics.uleth.ca > https://biophysics.uleth.ca/mailman/listinfo/demelerlab > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: otzd1GdqQWtcBwhz.png Type: image/png Size: 125475 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5060 bytes Desc: S/MIME Cryptographic Signature URL: