Describe the bug
I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.
Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:
$ dmesg | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0
It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:
|
resp = _utils.try_it( |
|
f=s3_client.get_object, |
|
ex=_S3_RETRYABLE_ERRORS, |
|
base=0.5, |
|
max_num_tries=6, |
|
Bucket=bucket, |
|
Key=key, |
|
Range=f"bytes={start}-{end - 1}", |
|
**boto3_kwargs, |
|
) |
|
return start, resp["Body"].read() |
Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?
For instance, using the following snippet I got my file without any issues on the same machine:
raw_stream = s3.get_object(**kwargs)["Body"]
with open("test_botocore_iter_chunks.gz", 'wb') as f:
for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
f.write(chunk)
I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:
|
if end - start >= self._s3_block_size: # Fetching length greater than cache length |
which just fits the whole response into a memory
How to Reproduce
use memory profiler on
wr.s3.download(path, local_file)
Expected behavior
Please let me know if it's already possible to read chunkified response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.6.9 -- this is old, but I can double check on newer versions
AWS SDK for pandas version
2.14.0
Additional context
No response
Describe the bug
I was using
wr.s3.downloadon 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and withoutuse_threadsparameter. It was measured using this memory profiler.Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs.
dmesggives a little different memory estimation:It turns out that
wr.s3.downloadby default usesbotocore'ss3.get_objectand fits whole response into a memory:aws-sdk-pandas/awswrangler/s3/_fs.py
Lines 65 to 75 in 7e83b89
Is it possible to chunkify reading of botocore response in
awswranglerto be more memory efficient?For instance, using the following snippet I got my file without any issues on the same machine:
I tried to use
wr.config.s3_block_sizeparameter expecting to chunkify the response but it does not help. After setting thes3_block_sizeup to be less than the file size you fall into thisifcondition:aws-sdk-pandas/awswrangler/s3/_fs.py
Line 326 in 7e83b89
which just fits the whole response into a memory
How to Reproduce
use memory profiler on
Expected behavior
Please let me know if it's already possible to read chunkified response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.6.9 -- this is old, but I can double check on newer versions
AWS SDK for pandas version
2.14.0
Additional context
No response