Skip to content

Problem when trying to download huggingface datasets when using ubuntu drive encryption

During setup of my seminar environment I noticed a problem during the download of the acronym_dataset. See line 36 in 03_datasets.ipynb:

acronym_dataset = datasets.load_dataset('acronym_identification')
# OSError: [Errno 36] File name too long: '/home/<user>/.cache/huggingface/datasets/_home_<user>_.cache_huggingface_datasets_acronym_identification_acronym_identification_1.0.0_e84facf8db848a4c7aa58addbebaf8a161c4146ca367e923ca972673cc915425.lock'

This filename usually is not over the limit; but after some research I found that using ecryptfs encryption may reduce the max filename size (see https://stackoverflow.com/questions/34503540/why-does-python-give-oserror-errno-36-file-name-too-long-for-filename-short).

I will continue to look into this issue and update this thread when I have found a solution / workaround.

Update

I have found the following workaround that requires changing some dataset library code:

Within the datasets library (in my case in /home/<user>/.local/share/data-engineering-analytics-notebooks-1QZelZ3p/lib/python3.9/site-packages/datasets/utils/filelock.py) change line 135 to reflect a lower max_filename_length:

class BaseFileLock:
    """
    Implements the base class of a file lock.
    """

    def __init__(self, lock_file, timeout=-1, max_filename_length=100):
        """ """
        # Hash the filename if it's too long
        lock_file = self.hash_filename_if_too_long(lock_file, max_filename_length)
        # The path to the lock file.
        self._lock_file = lock_file
        ...

Originally, the file name limit is set to 255. Note that I do not know if the library breaks because of this change. Use at your own discretion.

Edited by Zoe Pfister