Loading a dataset cached in a LocalFileSystem is not supported

lucavlasblom · June 7, 2025, 1:41pm

Hey i am new to huggingface and i got an error in this piece of code:
data_files = {
“train”: “/content/drive/MyDrive/fine_tuning_medroberta/train.txt”,
“validation”: “/content/drive/MyDrive/fine_tuning_medroberta/dev.txt”,
“test”: “/content/drive/MyDrive/fine_tuning_medroberta/test.txt”
}

dataset = load_dataset(“text”, data_files=data_files, split={
“train”: “train”,
“validation”: “validation”,
“test”: “test”
})

i get the following error:
NotImplementedError Traceback (most recent call last)
in <cell line: 0>()
5 }
6
----> 7 dataset = load_dataset(“text”, data_files=data_files, split={
8 “train”: “train”,
9 “validation”: “validation”,

1 frames
/usr/local/lib/python3.11/dist-packages/datasets/builder.py in as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
1171 is_local = not is_remote_filesystem(self._fs)
1172 if not is_local:
→ 1173 raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).name} is not supported.")
1174 if not os.path.exists(self._output_dir):
1175 raise FileNotFoundError(

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

can anyone help?

John6666 · June 8, 2025, 1:07am

I found several similar cases.

github.com/huggingface/datasets

load_dataset with data_dir and cache_dir set fail with not supported

opened 07:52PM - 08 May 24 UTC

fah

### Describe the bug with python 3.11 I execute: ```py from transformers im…port Wav2Vec2Processor, Data2VecAudioModel import torch from torch import nn from datasets import load_dataset, concatenate_datasets # load demo audio and set processor dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache") ``` This fails in the last line with ```log Found cached dataset librispeech_asr (file:///Users/as/Documents/Project/git/audio2vec/cache/librispeech_asr/clean-data_dir=data/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7) Traceback (most recent call last): File "/Users/as/Documents/Project/git/audio2vec/src/music2vec-v1.py", line 7, in <module> dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 1810, in load_dataset ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1113, in as_dataset raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.") NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported. ``` ### Steps to reproduce the bug I setup an venv with requirements.txt ```txt transformers==4.40.2 torch==2.2.2 datasets==2.16.0 fsspec==2023.9.2 ``` pip freeze is: ``` aiohttp==3.9.5 aiosignal==1.3.1 attrs==23.2.0 certifi==2024.2.2 charset-normalizer==3.3.2 datasets==2.16.0 dill==0.3.7 filelock==3.14.0 frozenlist==1.4.1 fsspec==2023.9.2 huggingface-hub==0.23.0 idna==3.7 Jinja2==3.1.4 MarkupSafe==2.1.5 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.15 networkx==3.3 numpy==1.26.4 packaging==24.0 pandas==2.2.2 pyarrow==16.0.0 pyarrow-hotfix==0.6 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.1 regex==2024.4.28 requests==2.31.0 safetensors==0.4.3 six==1.16.0 sympy==1.12 tokenizers==0.19.1 torch==2.2.2 tqdm==4.66.4 transformers==4.40.2 typing_extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 xxhash==3.4.1 yarl==1.9.4 ``` I execute this on a M1 Mac. ### Expected behavior I don't understand the error message. Why is "local" caching not supported. Would it possible to give some additional hint with the error message how to solve this issue? ### Environment info source .... python -u example.py

github.com/huggingface/datasets

Cache problem in the `load_dataset` method for local compressed file(s)

opened 01:34PM - 22 Sep 21 UTC

SaulLu

bug

## Describe the bug Cache problem in the `load_dataset` method: when modifyin…g a compressed file in a local folder `load_dataset` doesn't detect the change and load the previous version. ## Steps to reproduce the bug To test it directly, I have prepared a [Google Colaboratory notebook](https://colab.research.google.com/drive/11Em_Amoc-aPGhSBIkSHU2AvEh24nVayy?usp=sharing) that shows this behavior. For this example, I have created a toy dataset at: https://sptcsesxqb.proxynodejs.usequeue.com/datasets/SaulLu/toy_struc_dataset This dataset is composed of two versions: - v1 on commit `a6beb46` which has a single example `{'id': 1, 'value': {'tag': 'a', 'value': 1}}` in file `train.jsonl.gz` - v2 on commit `e7935f4` (`main` head) which has a single example `{'attr': 1, 'id': 1, 'value': 'a'}` in file `train.jsonl.gz` With a terminal, we can start to get the v1 version of the dataset ```bash git lfs install git clone https://sptcsesxqb.proxynodejs.usequeue.com/datasets/SaulLu/toy_struc_dataset cd toy_struc_dataset git checkout a6beb46 ``` Then we can load it with python and look at the content: ```python from datasets import load_dataset path = "/content/toy_struc_dataset" dataset = load_dataset(path, data_files={"train": "*.jsonl.gz"}) print(dataset["train"][0]) ``` Output ``` {'id': 1, 'value': {'tag': 'a', 'value': 1}} # This is the example in v1 ``` With a terminal, we can now start to get the v1 version of the dataset ```bash git checkout main ``` Then we can load it with python and look at the content: ```python from datasets import load_dataset path = "/content/toy_struc_dataset" dataset = load_dataset(path, data_files={"train": "*.jsonl.gz"}) print(dataset["train"][0]) ``` Output ``` {'id': 1, 'value': {'tag': 'a', 'value': 1}} # This is the example in v1 (not v2) ``` ## Expected results The last output should have been ``` {"id":1, "value": "a", "attr": 1} # This is the example in v2 ``` ## Ideas As discussed offline with Quentin, if the cache hash was ever sensitive to changes in a compressed file we would probably not have the problem anymore. This situation leads me to suggest 2 other features: - to also have an `load_from_cache_file` argument in the "load_dataset" method - to reorganize the cache so that we can delete the caches related to a dataset (cf issue #ToBeFilledSoon) And thanks again for this great library :hugs: ## Environment info - `datasets` version: 1.12.1 - Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic - Python version: 3.7.12 - PyArrow version: 3.0.0

Try like this:

dataset = load_dataset(“text”, data_files=data_files, download_mode="force_redownload", split={
    “train”: “train”,
    “validation”: “validation”,
    “test”: “test”
})

and just in case:

pip install -U datasets

lucavlasblom · June 8, 2025, 6:46pm

the pip install dataset worked, thank you so much!!!

Topic		Replies	Views
NotImplementedError when loading dataset with Streamlit 🤗Datasets	7	9769	June 9, 2025
How to load a huggingface dataset from local path? Beginners	5	6527	July 18, 2024
Using External Datasets with HuggingFace Data Loader Beginners	9	4325	April 27, 2022
Can't load dataset with simple CSV files 🤗Datasets	1	351	March 11, 2024
Dataset loading is not working 🤗Datasets	2	5040	September 13, 2022

Loading a dataset cached in a LocalFileSystem is not supported

Related topics