This is a guide that explains the installation of the HPC Performance Monitoring tool that I developed and shared at https://github.com/serdar-acir/HPC_Monitor, a real-time HPC performance monitoring tool with automatic node detection and basic benchmarking.
HPC Performance Monitoring Tool (PHP8)
A real-time HPC performance monitoring tool with automatic node detection and basic benchmarking. Please note this is a challenging installation that requires multiple platorms to be managed and multiple tools to be installed. This document is not intended to provide support in case you experience issues. You will need to figure out and solve the issue on your own. This repository contains a suite of Linux scripts designed for performance monitoring and resource benchmarking across compute nodes in an HPC environment.
Features
1. Hardware Data Collection
- Detects and collects essential hardware characteristics of compute nodes.
- Gathers details such as server type, CPU, GPU, RAM, interconnect, pci, drives, partitions, raid configuration etc.
2. Resource Benchmarking
- Executes a series of predefined benchmarks to evaluate current available resources.
- Benchmarks include CPU and GPU performance, memory utilization, interconnect bandwidth and disk subsystem bandwidth.
- Benchmarks are run at user-defined intervals to ensure up-to-date monitoring.
3. Data Transmission
- The collected data is sent to a MySQL-based web server.
- The server hosts a performance monitoring GUI for comprehensive oversight of node performance and resource utilization.
Requirements
- Good understanding of Linux commands and php programming (in case debugging is needed).
- Root access to all login and compute nodes.
- Outbound internet access from the login node.
- Linux-based HPC environment: PHP8, iperf3, inxi, bc (on all login and compute nodes)
- Hosting environment: Apache2, PHP8, Mysql8
Limitations
- This is the root user version of the tool. Root access to the HPC environment is required at the moment. But regular user accounts can be used on the hosting server.
Installation
Cloning
Clone this repository to the a user environment on the login node. It is preferrable to install the directory to a user account not root, to make sure it is accessible from all nodes, as the root account is usually restricted to the headnode.
git clone https://github.com/serdar-acir/HPC_Monitor.git
Setting up the hosting server
Setup a separate maria-db based web server on a hosting platform. I am not going to get into details of this step as opening an hosting account is beyond the scope of this document. Mysql or maria-db based hosting is required.
Upload the hosting_src directory to the hosting server and configure the HPC.config file for your HPC cluster. You can manage multiple HPC clusters with the GUI. Here is a sample HPC.config for two HPC clusters called HPC1 and HPC2.
<?php
//General configuration
date_default_timezone_set('Europe/Istanbul');
$clusters = ["HPC1", "HPC2"];
$descs = [
"15 compute nodes, 692 CPU Cores, 5.891 TB Memory, 384 TB Storage (+33.4 TB Local Storage), 19 GPUs, 102368 GPU Cores", //description of HPC1
"16 compute nodes, 240 CPU Cores, 4.3 TB Memory, 64 TB Storage, 12 GPUs, 29952 GPU Cores" //description of HPC2
];
//Database configuration
$database="database_name";
$host="localhost";
$user="username";
$password="password";
?>
Modify the timezone, cluster names and descriptions according to your own setup.
On the hosting server set up your maria-db database and visit to hosting_src/setup.php page via a browser. Enter the required data again and complete the installation. This will create necessary links.
Setting up the login node
At the login node access to the login_node_src folder and configure HPC1.config file according to your specific HPC environment as described in its README file. Rename HPC1.config to the name of your cluster such as my_cluster.config. “my_cluster” will be displayed on the monitoring tool as the cluster name.
Here is a sample HPC1.config.
<?php
$node_array = array ("login","cn01","cn02","cn03","cn04","cn05","cn06","cn07","cn08","cn09","cn10","cn11","cn12","cn13","cn14","cn15","cn16");
$home_ip = "b01"; //the name or the ip address of the storage unit (preferably the fast interconnect ip address such as infiniband, RoCE etc.) which you can autologin as root without password
// if there are multiple storage units choose one.
$recording_host = "http://xxx.xxx.xxx/"; //the url of the web interface
//the web interface both collects the data via http/s port and serves as the performance GUI
?>
Modify the node_array for your own HPC setup. The home_ip is the ip address of the storage unit (one of the metadata servers). Prefer a high speed interconnect like infiniband or RoCE IP address.
The recording_host is the subdomain or the ip address of the hosting where you will access the GUI web interface (the URL of the hosting).
Collecting HPC infrastructure data
At the login node make the script files executable.
chmod +x *.sh
Run the data_collect.sh script to collect HPC infrastructure data (make the file executable).
cd collect_data ; chmod +x data_collect.sh
./data_collect.sh
This will generate many .txt files about the HPC infrastructure (you can run as root if you do not have password access to the compute nodes). Move all the generated txt files to hosting server’s /run_as_root/ directory. Then simply visit the /run_as_root/set_others.php script on the hosting server such as (you do not need to be root on the hosting server). You should see COMPLETED message on the browser. This will import the text files to the database and delete them.
Crontab
Now connect to your login server as root. You need to collect performance data from your HPC cluster and send it to the hosting server periodically. At the login node enter a crontab entry, like:
*/5 * * * * cd /HPC_Monitor/login_node_src/root_version && /usr/bin/php sap_cron2.php
Run the command manually as root to check if data collection and report to the hosting server works fine.
The cron will set up 5 minutes benchmarking intervals.
Now you can access the performance monitoring GUI through your web browser to view the collected data. Simply visit the URL of the hosting server.
In the public scripts that I released I omitted the disk subsystem testing codes on purpose not to impose stress on the disk subsystem. If you prefer to add that, make sure you do not impose unnecessarily heavy workload on your storage devices.




