Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reducing number of compute resources to aggressively. #220

Closed
gwolski opened this issue Apr 11, 2024 · 2 comments · Fixed by #235
Closed

reducing number of compute resources to aggressively. #220

gwolski opened this issue Apr 11, 2024 · 2 comments · Fixed by #235
Assignees

Comments

@gwolski
Copy link

gwolski commented Apr 11, 2024

I'm building a cluster with just nine instance types and certain instances are being culled to "reduce number of CRs" - this is unnecessary as I do not have many compute resources.

Config file has:

InstanceConfig:
UseSpot: false
NodeCounts:
# @todo: Update the max number of each instance type to configure
DefaultMaxCount: 10
Include:
InstanceTypes:
- m7a.large
- m7a.xlarge
- m7a.2xlarge
- m7a.4xlarge
- r7a.large
- r7a.xlarge
- r7a.2xlarge
- r7a.4xlarge
- r7a.8xlarge

It then buckets appropriately:
INFO: Instance type by memory and core:
INFO: 6 unique memory size:
INFO: 8 GB
INFO: 1 instance type with 2 core(s): ['m7a.large']
INFO: 16 GB
INFO: 1 instance type with 2 core(s): ['r7a.large']
INFO: 1 instance type with 4 core(s): ['m7a.xlarge']
INFO: 32 GB
INFO: 1 instance type with 4 core(s): ['r7a.xlarge']
INFO: 1 instance type with 8 core(s): ['m7a.2xlarge']
INFO: 64 GB
INFO: 1 instance type with 8 core(s): ['r7a.2xlarge']
INFO: 1 instance type with 16 core(s): ['m7a.4xlarge']
INFO: 128 GB
INFO: 1 instance type with 16 core(s): ['r7a.4xlarge']
INFO: 256 GB
INFO: 1 instance type with 32 core(s): ['r7a.8xlarge']

But then it starts culling unnecessarily as parallecluster/slurm can handle 9 compute resources...

INFO: Configuring od-8-gb queue:
INFO: Adding od-8gb-2-cores compute resource: ['m7a.large']
INFO: Configuring od-16-gb queue:
INFO: Adding od-16gb-2-cores compute resource: ['r7a.large']
INFO: Skipping od-16gb-4-cores compute resource: ['m7a.xlarge'] to reduce number of CRs.
INFO: Configuring od-32-gb queue:
INFO: Adding od-32gb-4-cores compute resource: ['r7a.xlarge']
INFO: Skipping od-32gb-8-cores compute resource: ['m7a.2xlarge'] to reduce number of CRs.
INFO: Configuring od-64-gb queue:
INFO: Adding od-64gb-8-cores compute resource: ['r7a.2xlarge']
INFO: Skipping od-64gb-16-cores compute resource: ['m7a.4xlarge'] to reduce number of CRs.
INFO: Configuring od-128-gb queue:
INFO: Adding od-128gb-16-cores compute resource: ['r7a.4xlarge']
INFO: Configuring od-256-gb queue:
INFO: Adding od-256gb-32-cores compute resource: ['r7a.8xlarge']
INFO: Created 6 queues with 6 compute resources

I would like to have a 16 core 64G machine, a 32G 8 core machine, etc.. How to disable/modify this "culling". I would argue we should only start culling when we exceed what parallelcluster can handle.

We can now have 50 slurm queues per cluster, and 50 compute resources per queue and 50 compute resources per cluster! See:
https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration-of-multiple-queues-v3.html

@gwolski
Copy link
Author

gwolski commented Apr 13, 2024

I've found by comment out the following three lines in source/cdk/cdk_slurm_stack.py I could turn off the reduction code (at line 2770 in the code I have):

                    if len(parallel_cluster_queue['ComputeResources']):
                        logger.info(f"    Skipping {compute_resource_name:18} compute resource: {instance_types} to reduce number of CRs.")
                        continue

The next line checks if I've exceed the MAX_NUMBER_OF_COMPUTE_RESOURCES, so there is a nice check in case my configuration were to be too much.

I want to be able to have machines with the same cores and less memory - no need to pay for more than I need.

@cartalla cartalla self-assigned this Apr 24, 2024
cartalla added a commit that referenced this issue May 17, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.

This led to people no being able to configure the exact instance types they
wanted.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220
cartalla added a commit that referenced this issue May 17, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.

This led to people no being able to configure the exact instance types they
wanted.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220
cartalla added a commit that referenced this issue May 17, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.

This led to people no being able to configure the exact instance types they
wanted.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220
@cartalla
Copy link
Contributor

I was trying to configure as many instance types as allowed by ParallelCluster's limits, but in retrospect, should really leave this up to the user to configure.

I've changed the code to just create 1 instance type per CR and 1 CR per queue/partition.
This should allow you to pick and choose which instances.
It is now an error if you configure too many instance types and you must either remove included instances or exclude instances until you get under the ParallelCluster limit of 50.

@cartalla cartalla linked a pull request May 17, 2024 that will close this issue
cartalla added a commit that referenced this issue May 21, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.
This was to try to maximize the number of instance types that were configured.

This led to people not being able to configure the exact instance types they
wanted.
The preference is to notify the user and let them choose which instances types
to exclude or to reduce the number of included types.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220
cartalla added a commit that referenced this issue May 23, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.
This was to try to maximize the number of instance types that were configured.

This led to people not being able to configure the exact instance types they
wanted.
The preference is to notify the user and let them choose which instances types
to exclude or to reduce the number of included types.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220

Update ParallelCluster version in config files and docs.
cartalla added a commit that referenced this issue May 23, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.
This was to try to maximize the number of instance types that were configured.

This led to people not being able to configure the exact instance types they
wanted.
The preference is to notify the user and let them choose which instances types
to exclude or to reduce the number of included types.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220

Update ParallelCluster version in config files and docs.

Clean up security scan.
cartalla added a commit that referenced this issue May 23, 2024
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.
This was to try to maximize the number of instance types that were configured.

This led to people not being able to configure the exact instance types they
wanted.
The preference is to notify the user and let them choose which instances types
to exclude or to reduce the number of included types.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220

Update ParallelCluster version in config files and docs.

Clean up security scan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants