Skip to content

Commit 7e9f339

Browse files
committed
initial commit
1 parent 2a56da9 commit 7e9f339

28 files changed

+9576
-7
lines changed

CHANGELOG.md

Whitespace-only changes.

README.md

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,87 @@
1-
## My Project
1+
# Grafana Dashboard for AWS ParallelCluster
22

3-
TODO: Fill this README out!
3+
This is a sample solution based on Grafana for monitoring various component of an HPC cluster built with AWS ParallelCluster.
4+
There are 6 dashboards that can be used as they are or customized as you need.
5+
* ParallelCluster Stats - this is the main dashboard taht shows general monitoring info and metrics for the whole cluster. It includes Slurm metrics and Storage performance metrics.
6+
* Master Node Details - this dashboard shows detailed metric for the Master node, including CPU, Memory, Network and Storage usage.
7+
* Compute Node List - this dashboard show the list of the available compute nodes. Each entry is a link to a more detailed page.
8+
* Compute Node Details - similarly to the master node details this dashboard show the same metric for the compute nodes.
9+
* Cluster Logs - This dashboard shows all the logs of your HPC Cluster. The logs are pushed by AWS ParallelCluster to AWS ClowdWatch Logs and finally reported here.
10+
* Cluster Costs (beta / in developemnt) - This dashboard shows the cost associated to every AWS Service utilized by your Cluster. It includes: EC2, EBS, FSx, S3, EFS.
411

5-
Be sure to:
612

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
13+
## AWS ParallelCluster
14+
**AWS ParallelCluster** is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and
15+
manage High Performance Computing (HPC) clusters in the AWS cloud.
16+
It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm.
17+
* More info on: https://aws.amazon.com/hpc/parallelcluster/
18+
* Source Code on Git-Hub: https://github.com/aws/aws-parallelcluster
19+
* Official Documentation: https://docs.aws.amazon.com/parallelcluster/
20+
21+
22+
## Solution components
23+
This project is build with the following components:
24+
25+
* **Grafana** is an open-source (https://github.com/grafana/grafana) platform for monitoring and observability. Grafana allows you to query, visualize, alert on and understand your metrics as well as create, explore, and share dashboards fostering a data driven culture.
26+
* **Prometheus** open-source (https://github.com/prometheus/prometheus/) project for systems and service monitoring from the Cloud Native Computing Foundation (https://cncf.io/). It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
27+
* The **Prometheus Pushgateway** is on open-source (https://github.com/prometheus/pushgateway/) tool that allows ephemeral and batch jobs to expose their metrics to Prometheus.
28+
* **Nginx** (http://nginx.org/) is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server.
29+
* **Prometheus-Slurm-Exporter** (https://github.com/vpenso/prometheus-slurm-exporter/) is a Prometheus collector and exporter for metrics extracted from the Slurm (https://slurm.schedmd.com/overview.html) resource scheduling system.
30+
* **Node_exporter** (https://github.com/prometheus/node_exporter) is a Prometheus exporter for hardware and OS metrics exposed by \*NIX kernels, written in Go with pluggable metric collectors.
31+
32+
Note: *while almost all components are under the Apache2 license, only **Prometheus-Slurm-Exporter is licensed under GPLv3**, you need to be aware of it and accept the license terms before proceeding and installing this component.*
33+
34+
35+
## Example Dashboards
36+
37+
![ParallelCluster](docs/ParallelCluster.png?raw=true "AWS ParallelCluster")
38+
39+
![Master](docs/Master.png?raw=true "Master Node")
40+
41+
![Compute Node List](docs/List.png?raw=true "Compute Node List")
42+
43+
![Logs](docs/Logs.png?raw=true "AWS ParallelCluster Logs")
44+
45+
![Costs](docs/Costs.png?raw=true "Best - AWS ParallelCluster Costs")
46+
47+
48+
## How to use it
49+
50+
You can simply use the post-install script that you can find in this git-hub (http://link/) repo as it is, or customize it as you need. For instance, you might want to change your Grafana password to something more secure and meaningful for you, or you might want to customize some dashboards by adding additional components to monitor.
51+
The proposed post-install script will take care of installing and configuring everything for you. Though, few additional parameters are needed in the AWS ParallelCluster config file: the post_install_args, additional IAM policies, security group, and a tag. Please note that, at the moment, the post install script has only been tested using Amazon Linux 2 (https://aws.amazon.com/amazon-linux-2/).
52+
53+
```
54+
base_os = alinux2
55+
56+
post_install = s3://<my-bucket-name>/grafana-post-install.sh
57+
58+
post_install_args = "https://github.com/aws-samples/aws-parallelcluster-monitoring/archive/main.zip"
59+
60+
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
61+
62+
tags = {“Grafana” : “true”}
63+
```
64+
65+
Make sure that port 80 and port 443 of your master node are accessible from the internet (or form your network). You can achieve this by creating the appropriate security group via AWS Web-Console or via Command Line Interface (CLI (https://docs.aws.amazon.com/cli/index.html)), see an example below:
66+
67+
```
68+
aws ec2 create-security-group --group-name my-grafana-sg --description "Open Grafana dashboard ports" —vpc-id vpc-1a2b3c4d
69+
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 443 —cidr 0.0.0.0/0
70+
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 80 —cidr 0.0.0.0/0
71+
```
72+
73+
More information on how to create your security groups here (https://docs.aws.amazon.com/cli/latest/userguide/cli-services-ec2-sg.html#creating-a-security-group).
74+
Finally, set the additional_sg parameter in the [VPC] section of your ParallelCluster config file.
75+
After your cluster is created, you can just open a web-browser and connect to https://your_public_ip (https://your_public_ip/) , a landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
76+
77+
78+
Note: Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the master node,
79+
in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.
980

1081
## Security
1182

1283
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
1384

1485
## License
1586

16-
This library is licensed under the MIT-0 License. See the LICENSE file.
17-
87+
This library is licensed under the MIT-0 License. See the LICENSE file.

custom-metrics/1h-cost-metrics.sh

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/bin/bash
2+
3+
#source the AWS ParallelCluster profile
4+
. /etc/parallelcluster/cfnconfig
5+
6+
export AWS_DEFAULT_REGION=$cfn_region
7+
aws_region_long_name=$(python /usr/local/bin/aws-region.py $cfn_region)
8+
9+
masterInstanceType=$(ec2-metadata -t | awk '{print $2}')
10+
masterInstanceId=$(ec2-metadata -i | awk '{print $2}')
11+
s3_bucket=$(echo $cfn_postinstall | sed "s/s3:\/\///g;s/\/.*//")
12+
s3_size_gb=$(echo "$(aws s3api list-objects --bucket $s3_bucket --output json --query "[sum(Contents[].Size)]"| sed -n 2p | tr -d ' ') / 1024 / 1024 / 1024" | bc)
13+
14+
15+
#retrieve the s3 cost
16+
if [[ $s3_size_gb -le 51200 ]]; then
17+
s3_range=51200
18+
elif [[ $VAR -le 512000 ]]; then
19+
s3_range=512000
20+
else
21+
s3_range="Inf"
22+
fi
23+
24+
####################### S3 #########################
25+
26+
s3_cost_gb_month=$(aws --region us-east-1 pricing get-products \
27+
--service-code AmazonS3 \
28+
--filters 'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
29+
'Type=TERM_MATCH,Field=storageClass,Value=General Purpose' \
30+
--query 'PriceList[0]' --output text \
31+
| jq -r --arg endRange $s3_range '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[].value | select(.endRange==$endRange).pricePerUnit.USD')
32+
33+
s3=$(echo "scale=2; $s3_cost_gb_month * $s3_size_gb / 720" | bc)
34+
echo "s3_cost $s3" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost
35+
36+
37+
####################### Master #########################
38+
master_node_h_price=$(aws pricing get-products \
39+
--region us-east-1 \
40+
--service-code AmazonEC2 \
41+
--filters 'Type=TERM_MATCH,Field=instanceType,Value='$masterInstanceType \
42+
'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
43+
'Type=TERM_MATCH,Field=preInstalledSw,Value=NA' \
44+
'Type=TERM_MATCH,Field=operatingSystem,Value=Linux' \
45+
'Type=TERM_MATCH,Field=tenancy,Value=Shared' \
46+
'Type=TERM_MATCH,Field=capacitystatus,Value=UnusedCapacityReservation' \
47+
--output text \
48+
--query 'PriceList' \
49+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
50+
51+
echo "master_node_cost $master_node_h_price" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost
52+
53+
54+
####################### FSX #########################
55+
fsx_size_gb=$(aws cloudformation describe-stacks --stack-name $stack_name --region $cfn_region \
56+
| jq -r '.Stacks[0].Parameters | map(select(.ParameterKey == "FSXOptions"))[0].ParameterValue' \
57+
| awk -F "," '{print $3}')
58+
59+
fsx_type=$(aws cloudformation describe-stacks --stack-name $stack_name --region $cfn_region \
60+
| jq -r '.Stacks[0].Parameters | map(select(.ParameterKey == "FSXOptions"))[0].ParameterValue' \
61+
| awk -F "," '{print $9}')
62+
63+
fsx_throughput=$(aws cloudformation describe-stacks --stack-name $stack_name --region $cfn_region \
64+
| jq -r '.Stacks[0].Parameters | map(select(.ParameterKey == "FSXOptions"))[0].ParameterValue' \
65+
| awk -F "," '{print $10}')
66+
67+
if [[ $fsx_type = "SCRATCH_2" ]] || [[ $fsx_type = "SCRATCH_1" ]]; then
68+
fsx_cost_gb_month=$(aws pricing get-products \
69+
--region us-east-1 \
70+
--service-code AmazonFSx \
71+
--filters 'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
72+
'Type=TERM_MATCH,Field=fileSystemType,Value=Lustre' \
73+
'Type=TERM_MATCH,Field=throughputCapacity,Value=N/A' \
74+
--output text \
75+
--query 'PriceList' \
76+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
77+
78+
elif [ $fsx_type = "PERSISTENT_1" ]; then
79+
fsx_cost_gb_month=$(aws pricing get-products \
80+
--region us-east-1 \
81+
--service-code AmazonFSx \
82+
--filters 'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
83+
'Type=TERM_MATCH,Field=fileSystemType,Value=Lustre' \
84+
'Type=TERM_MATCH,Field=throughputCapacity,Value='$fsx_throughput \
85+
--output text \
86+
--query 'PriceList' \
87+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
88+
89+
else
90+
fsx_cost_gb_month=0
91+
fi
92+
93+
fsx=$(echo "scale=2; $fsx_cost_gb_month * $fsx_size_gb / 720" | bc)
94+
echo "fsx_cost $fsx" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost
95+
96+
97+
#parametrize:
98+
ebs_volume_total_cost=0
99+
ebs_volume_ids=$(aws ec2 describe-instances --instance-ids $masterInstanceId \
100+
| jq -r '.Reservations | to_entries[].value | .Instances | to_entries[].value | .BlockDeviceMappings | to_entries[].value | .Ebs.VolumeId')
101+
102+
for ebs_volume_id in $ebs_volume_ids
103+
do
104+
ebs_volume_type=$(aws ec2 describe-volumes --volume-ids $ebs_volume_id | jq -r '.Volumes | to_entries[].value.VolumeType')
105+
#ebs_volume_iops=$(aws ec2 describe-volumes --volume-ids $ebs_volume_id | jq -r '.Volumes | to_entries[].value.Iops')
106+
ebs_volume_size=$(aws ec2 describe-volumes --volume-ids $ebs_volume_id | jq -r '.Volumes | to_entries[].value.Size')
107+
108+
ebs_cost_gb_month=$(aws --region us-east-1 pricing get-products \
109+
--service-code AmazonEC2 \
110+
--query 'PriceList' \
111+
--output text \
112+
--filters 'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
113+
'Type=TERM_MATCH,Field=productFamily,Value=Storage' \
114+
'Type=TERM_MATCH,Field=volumeApiName,Value='$ebs_volume_type \
115+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
116+
117+
ebs_volume_cost=$(echo "scale=2; $ebs_cost_gb_month * $ebs_volume_size / 720" | bc)
118+
ebs_volume_total_cost=$(echo "scale=2; $ebs_volume_total_cost + $ebs_volume_cost" | bc)
119+
done
120+
121+
echo "ebs_master_cost $ebs_volume_total_cost" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost

custom-metrics/1m-cost-metrics.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/bin/bash
2+
3+
#source the AWS ParallelCluster profile
4+
. /etc/parallelcluster/cfnconfig
5+
6+
export AWS_DEFAULT_REGION=$cfn_region
7+
aws_region_long_name=$(python /usr/local/bin/aws-region.py $cfn_region)
8+
computeInstanceType=$(aws cloudformation describe-stacks --stack-name $stack_name --region $cfn_region | jq -r '.Stacks[0].Parameters | map(select(.ParameterKey == "ComputeInstanceType"))[0].ParameterValue')
9+
10+
compute_node_h_price=$(aws pricing get-products \
11+
--region us-east-1 \
12+
--service-code AmazonEC2 \
13+
--filters 'Type=TERM_MATCH,Field=instanceType,Value='$computeInstanceType \
14+
'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
15+
'Type=TERM_MATCH,Field=preInstalledSw,Value=NA' \
16+
'Type=TERM_MATCH,Field=operatingSystem,Value=Linux' \
17+
'Type=TERM_MATCH,Field=tenancy,Value=Shared' \
18+
'Type=TERM_MATCH,Field=capacitystatus,Value=UnusedCapacityReservation' \
19+
--output text \
20+
--query 'PriceList' \
21+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
22+
23+
#ebs_volume_id=$(aws ec2 describe-instances --instance-ids $computeInstanceId \
24+
# | jq -r '.Reservations | to_entries[].value | .Instances | to_entries[].value | .BlockDeviceMappings | to_entries[].value | .Ebs.VolumeId' \
25+
# | tail -1) #remove this tail
26+
27+
#ebs_volume_type=$(aws ec2 describe-volumes --volume-ids $ebs_volume_id | jq -r '.Volumes | to_entries[].value.VolumeType')
28+
#ebs_volume_iops=$(aws ec2 describe-volumes --volume-ids $ebs_volume_id | jq -r '.Volumes | to_entries[].value.Iops')
29+
ebs_volume_size=$(aws cloudformation describe-stacks --stack-name $stack_name --region $cfn_region | jq -r '.Stacks[0].Parameters | map(select(.ParameterKey == "ComputeRootVolumeSize"))[0].ParameterValue')
30+
31+
#check if volumeApiName can chane in the future, for now "gp2" is hardcoded
32+
ebs_cost_gb_month=$(aws --region us-east-1 pricing get-products \
33+
--service-code AmazonEC2 \
34+
--query 'PriceList' \
35+
--output text \
36+
--filters 'Type=TERM_MATCH,Field=location,Value='"${aws_region_long_name}" \
37+
'Type=TERM_MATCH,Field=productFamily,Value=Storage' \
38+
'Type=TERM_MATCH,Field=volumeApiName,Value=gp2' \
39+
| jq -r '.terms.OnDemand | to_entries[] | .value.priceDimensions | to_entries[] | .value.pricePerUnit.USD')
40+
41+
total_num_compute_nodes=$(/opt/slurm/bin/sinfo -O "nodes" --noheader)
42+
compute_ebs_volume_cost=$(echo "scale=2; $ebs_cost_gb_month * $total_num_compute_nodes * $ebs_volume_size / 720" | bc)
43+
compute_nodes_cost=$(echo "scale=2; $total_num_compute_nodes * $compute_node_h_price" | bc)
44+
45+
46+
echo "ebs_compute_cost $compute_ebs_volume_cost" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost
47+
echo "compute_nodes_cost $compute_nodes_cost" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/cost

custom-metrics/aws-region.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import json
2+
import sys
3+
4+
from pkg_resources import resource_filename
5+
6+
region = str(sys.argv[1])
7+
8+
name = None
9+
endpoint_file = resource_filename('botocore', 'data/endpoints.json')
10+
with open(endpoint_file, 'r') as ep_file:
11+
data = json.load(ep_file)
12+
for partition in data['partitions']:
13+
if region in partition['regions']:
14+
name = partition['regions'][region]['description']
15+
break
16+
17+
print(name)
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
version: '3.8'
2+
services:
3+
prometheus-node-exporter:
4+
container_name: node-exporter
5+
network_mode: host
6+
pid: host
7+
restart: unless-stopped
8+
volumes:
9+
- '/:/host:ro,rslave'
10+
image: quay.io/prometheus/node-exporter
11+
command:
12+
- '--path.rootfs=/host'
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
version: '3.8'
2+
services:
3+
pushgateway:
4+
container_name: pushgateway
5+
network_mode: host
6+
pid: host
7+
restart: unless-stopped
8+
ports:
9+
- '9091:9091'
10+
image: prom/pushgateway
11+
prometheus:
12+
container_name: prometheus
13+
network_mode: host
14+
pid: host
15+
restart: unless-stopped
16+
ports:
17+
- '9090:9090'
18+
volumes:
19+
- '/home/$cfn_cluster_user/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml'
20+
- 'prometheus-data:/prometheus'
21+
image: prom/prometheus
22+
command:
23+
- '--config.file=/etc/prometheus/prometheus.yml'
24+
- '--storage.tsdb.path=/prometheus'
25+
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
26+
- '--web.console.templates=/usr/share/prometheus/consoles'
27+
- '--web.external-url=/prometheus/'
28+
- '--web.route-prefix=/'
29+
grafana:
30+
container_name: grafana
31+
network_mode: host
32+
pid: host
33+
restart: unless-stopped
34+
ports:
35+
- '3000:3000'
36+
environment:
37+
- 'GF_SECURITY_ADMIN_PASSWORD=Grafana4PC!'
38+
- 'GF_SERVER_ROOT_URL=http://%(domain)s/grafana/'
39+
volumes:
40+
- '/home/$cfn_cluster_user/grafana:/etc/grafana/provisioning'
41+
- 'grafana-data:/var/lib/grafana'
42+
image: grafana/grafana
43+
prometheus-node-exporter:
44+
container_name: node-exporter
45+
network_mode: host
46+
pid: host
47+
restart: unless-stopped
48+
volumes:
49+
- '/:/host:ro,rslave'
50+
image: quay.io/prometheus/node-exporter
51+
command:
52+
- '--path.rootfs=/host'
53+
nginx:
54+
container_name: nginx
55+
network_mode: host
56+
pid: host
57+
ports:
58+
- '443:443'
59+
restart: unless-stopped
60+
volumes:
61+
- '/home/$cfn_cluster_user/nginx/conf.d:/etc/nginx/conf.d/'
62+
- '/home/$cfn_cluster_user/nginx/ssl:/etc/ssl/'
63+
- '/home/$cfn_cluster_user/www:/usr/share/nginx/html'
64+
image: nginx
65+
volumes:
66+
prometheus-data:
67+
grafana-data:

docs/Costs.png

169 KB
Loading

docs/List.png

125 KB
Loading

docs/Logs.png

575 KB
Loading

0 commit comments

Comments
 (0)