Created by: davides
Patch Description This change decouples metaseq and iopath with the following:
- Copy the main classes into the local file_io package
- Copy the associated unit tests
- Drop functionality not needed in metaseq: TabularIO, telemetry
- Update setup.py.
portalocker
was previously a transitive dependency, and we still need it formetaseq.file_io.common.file_lock
- Pull forward the pending AzureBlobPathHandler from https://github.com/facebookresearch/iopath/pull/17
- Some controls have been put in place so that read/write operations use a known amount of memory when dealing with larger files:
-
_open("wb", buffering=<buffer-size>)
will buffer up to the requested amount of data in memory before flushing it to the service with the PutBlock operation -
_open("rb", buffering=<buffer-size>)
will use the Blob client's chunk iterator to only download a fixed amount of data at a time -
_close()
in write-mode will flush any buffered data with one more PutBlock, and finalize the blob with PutBlockList - The block-based approach should work for both block blobs and append blobs (see the Azure docs).
-
- Some controls have been put in place so that read/write operations use a known amount of memory when dealing with larger files:
Testing steps
$ python -m unittest discover tests/file_io/
.10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
10/21/2022 12:23:17 PM Caching az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin ...
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file2.bin
10/21/2022 12:23:17 PM URL az://lrsstoragewest3/data/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin was already cached in /tmp/tmpy3w003kj/blob_cache/temp/unittest/68b4fa21-491a-40fc-b048-9d18d528d00b/testdir/file1.bin
.10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, mode=rb
10/21/2022 12:23:17 PM Read next chunk: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin, length=4096
10/21/2022 12:23:17 PM Opening blob: path=az://lrsstoragewest3/data/temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, mode=wb
10/21/2022 12:23:17 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDA=, idx=0, length=4096
10/21/2022 12:23:18 PM Committing blocks: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, count=1
10/21/2022 12:23:18 PM Uploading a new block: blob_name=temp/unittest/d1e6bc10-0af5-4d93-9a37-221222415b77/testdir/file1.bin.streamed, block_id=MDAwMDE=, idx=1, length=4096
....10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
10/21/2022 12:23:19 PM [PathManager] foo=foo argument ignored
..............................sssssssssssssss
----------------------------------------------------------------------
Ran 51 tests in 2.752s
OK (skipped=15)
(The skipped tests are for S3PathHandler
which was unchanged; just moved verbatim from metaseq/s3_utils.py
to metaseq/file_io/s3.py
)