This is a guide that explains the installation of the HPC Performance Monitoring tool that I developed and shared at https://github.com/serdar-acir/HPC_Monitor
, a real-time HPC performance monitoring tool with automatic node detection and basic benchmarking.
HPC Performance Monitoring Tool
A real-time HPC performance monitoring tool with automatic node detection and basic benchmarking. This repository contains a suite of Linux scripts designed for performance monitoring and resource benchmarking across compute nodes in an HPC environment.
Features
1. Hardware Data Collection
- Detects and collects essential hardware characteristics of compute nodes.
- Gathers details such as server type, CPU, GPU, RAM, interconnect, pci, drives, partitions, raid configuration etc.
2. Resource Benchmarking
- Executes a series of predefined benchmarks to evaluate current available resources.
- Benchmarks include CPU and GPU performance, memory utilization, interconnect bandwidth and disk subsystem bandwidth.
- Benchmarks are run at user-defined intervals to ensure up-to-date monitoring.
3. Data Transmission
- The collected data is sent to a MySQL-based web server.
- The server hosts a performance monitoring GUI for comprehensive oversight of node performance and resource utilization.
Requirements
- Linux-based HPC environment: PHP7, iperf3, inxi
- Hosting environment: Apache2, PHP7, Mysql8
Limitations
- This is the root user version of the tool. Root access to the HPC environment is required at the moment. But regular user accounts can be used on the hosting server.
Installation
Cloning
Clone this repository to the a user environment on the login node. It is preferrable to install the directory to a user account not root, to make sure it is accessible from all nodes, as the root account is usually restricted to the headnode.
git clone https://github.com/serdar-acir/HPC_Monitor.git
Setting up the hosting server
Setup a separate maria-db based web server on a hosting platform. I am not going to get into details of this step as opening an hosting account is beyond the scope of this document. Mysql or maria-db based hosting is required.
Upload the hosting_src
directory to the hosting server and configure the HPC.config
file for your HPC cluster. You can manage multiple HPC clusters with the GUI. Here is a sample HPC.config
for two HPC clusters called HPC1 and HPC2.
<?php
//General configuration
date_default_timezone_set('Europe/Istanbul');
$clusters = ["HPC1", "HPC2"];
$descs = [
"15 compute nodes, 692 CPU Cores, 5.891 TB Memory, 384 TB Storage (+33.4 TB Local Storage), 19 GPUs, 102368 GPU Cores", //description of HPC1
"16 compute nodes, 240 CPU Cores, 4.3 TB Memory, 64 TB Storage, 12 GPUs, 29952 GPU Cores" //description of HPC2
];
//Database configuration
$database="database_name";
$host="localhost";
$user="username";
$password="password";
?>
Modify the timezone, cluster names and descriptions according to your own setup.
On the hosting server set up your maria-db database and reach to hosting_src/setup.php
page via a browser. Enter the required data and complete the installation.
Setting up the login node
At the login node access to the login_node_src
folder as root and configure HPC1.config
file according to your specific HPC environment as described in its README file. You can name HPC1.config
to the name of your cluster such as my_cluster.config
. “my_cluster” will be displayed on the monitoring tool as the cluster name.
Here is a sample HPC1.config
.
<?php
$node_array = array ("login","cn01","cn02","cn03","cn04","cn05","cn06","cn07","cn08","cn09","cn10","cn11","cn12","cn13","cn14","cn15","cn16");
$home_ip = "1.2.3.4"; //ip address of the storage unit (preferably the fast interconnect ip address such as infiniband, RoCE etc.)
// if there are multiple storage units choose one.
$recording_host = "http://xxx.xxx.xxx/"; //the url of the web interface
//the web interface both collects the data via http/s port and serves as the performance GUI
?>
Modify the node_array
for your own HPC setup. The home_ip
is the ip address of the storage unit usually through a high speed interconnect like infiniband or RoCE. Make sure you do not enter the ip address of the management ethernet here.
The recording_host
is the ip address or the subdomain of the hosting where you will access the GUI web interface.
Collecting HPC infrastructure data
At the login node run the data_collect.sh
script to collect HPC infrastructure data.
./login_node_src/collect_data/data_collect.sh
This will generate many .txt
files about the HPC infrastructure. Move all the generated txt files to hosting server’s /run_as_root/
directory and then on the hosting server run the /run_as_root/set_others.php
script (you can simply visit that page).
Crontab
You need to collect performance data from your HPC cluster and send it to the hosting server periodically. At the login node enter a crontab
entry, like:
*/5 * * * * cd ~/HPC_Monitor/root_version && /usr/bin/php ~/HPC_Monitor/root_version/sap_cron2.php
This will set up 5 minutes benchmarking intervals.
Now you can access the performance monitoring GUI through your web browser to view the collected data.
In the public scripts that I released I omitted the disk subsystem testing codes on purpose not to impose stress on the disk subsystem. If you prefer to add that, make sure you do not impose unnecessarily heavy workload on your storage devices.