Problem when trying to download huggingface datasets when using ubuntu drive encryption
During setup of my seminar environment I noticed a problem during the download of the acronym_dataset
.
See line 36
in 03_datasets.ipynb
:
acronym_dataset = datasets.load_dataset('acronym_identification')
# OSError: [Errno 36] File name too long: '/home/<user>/.cache/huggingface/datasets/_home_<user>_.cache_huggingface_datasets_acronym_identification_acronym_identification_1.0.0_e84facf8db848a4c7aa58addbebaf8a161c4146ca367e923ca972673cc915425.lock'
This filename usually is not over the limit; but after some research I found that using ecryptfs
encryption may reduce the max filename size (see https://stackoverflow.com/questions/34503540/why-does-python-give-oserror-errno-36-file-name-too-long-for-filename-short).
I will continue to look into this issue and update this thread when I have found a solution / workaround.
Update
I have found the following workaround that requires changing some dataset
library code:
Within the datasets library (in my case in /home/<user>/.local/share/data-engineering-analytics-notebooks-1QZelZ3p/lib/python3.9/site-packages/datasets/utils/filelock.py
) change line 135 to reflect a lower max_filename_length
:
class BaseFileLock:
"""
Implements the base class of a file lock.
"""
def __init__(self, lock_file, timeout=-1, max_filename_length=100):
""" """
# Hash the filename if it's too long
lock_file = self.hash_filename_if_too_long(lock_file, max_filename_length)
# The path to the lock file.
self._lock_file = lock_file
...
Originally, the file name limit is set to 255. Note that I do not know if the library breaks because of this change. Use at your own discretion.