Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 85 - Remove S3 depencies #89

Merged
merged 6 commits into from
May 28, 2024
Merged

Conversation

RajeshRamchander
Copy link
Contributor

Issue #85

Replaced S3 dependencies with local file system. Made it configurable. It is also backward compatible with the existing configuration files.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

"logger.info(f\"jsonl_files={jsonl_files}\")\n",
"# Read and concatenate only the .jsonl files\n",
"df = pd.concat([pd.read_json(io.BytesIO(s3_client.get_object(Bucket=config['s3_read_data']['read_bucket'], Key=file_key)['Body'].read()), lines=True) \n",
"# df = pd.concat([pd.read_json(io.BytesIO(s3_client.get_object(Bucket=config['s3_read_data']['read_bucket'], Key=file_key)['Body'].read()), lines=True) \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove commented out code (same for other places as well).

@@ -7,7 +7,11 @@ aws:
# AWS region, this parameter is templatized, no need to change
region: {region}
# SageMaker execution role used to run FMBench, this parameter is templatized, no need to change
sagemaker_execution_role: {role_arn}
sagemaker_execution_role: arn:aws:iam::266000547366:role/fmbench-us-east-1-role
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be the {role_arn}, we can programatically determine it by doing get_caller_identity from the boto3 sts client and that would work for both the SageMaker Notebook case and the EC2 case.

# Use S3 only, local file system only, or both (values are s3, local or both)
# If set to local or both, set the local_file_system_path parameter
s3_and_or_local_file_system: local
local_file_system_path: /Users/rajramch/WorkDocsDownloads/workspace/aws/services/sagemaker/fmbench/sagemaker-fmbench-write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should parameterize this so maybe something like {temp_dir} which gets replaced with tempfile.gettempdir() when the config is read (similar to how we handle the read_bucket and other parameterized config file variable).

# Use S3 only or local file system only (values are s3 or local)
# If set to local, set the local_file_system_path parameter
s3_or_local_file_system: local
local_file_system_path: /Users/rajramch/WorkDocsDownloads/workspace/aws/services/sagemaker/fmbench/sagemaker-fmbench-read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

@@ -13,11 +13,11 @@
from typing import Dict, List
from transformers import AutoTokenizer
from botocore.exceptions import NoCredentialsError
import shutil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this above to its appropriate place so that we have the imports in an increasing order of length, just a stylistic preference.

@@ -168,17 +168,68 @@ def process_item(item, prompt_template_keys: List, prompt_fmt: str) -> Dict:
def nt_to_posix(p: str) -> str:
return p.replace("\\", "/")

def is_read_local() -> str:
is_read_local = globals.config.get('s3_read_data').get('s3_or_local_file_system')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested tweak:

mode = globals.config.get('s3_read_data').get('s3_or_local_file_system', globals.FS_S3)
return mode == globals.FS_LOCAL

and we can define FS_LOCAL and FS_S3 in globals.py.

return local_read_path

def _is_write_local_only():
is_write_local_only = globals.config.get('aws').get('s3_and_or_local_file_system')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendation as above.

@aarora79 aarora79 merged commit 1c9f20f into aws-samples:main May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants