Hey i am new to huggingface and i got an error in this piece of code:
data_files = {
“train”: “/content/drive/MyDrive/fine_tuning_medroberta/train.txt”,
“validation”: “/content/drive/MyDrive/fine_tuning_medroberta/dev.txt”,
“test”: “/content/drive/MyDrive/fine_tuning_medroberta/test.txt”
}
dataset = load_dataset(“text”, data_files=data_files, split={
“train”: “train”,
“validation”: “validation”,
“test”: “test”
})
i get the following error:
NotImplementedError Traceback (most recent call last)
in <cell line: 0>()
5 }
6
----> 7 dataset = load_dataset(“text”, data_files=data_files, split={
8 “train”: “train”,
9 “validation”: “validation”,
1 frames
/usr/local/lib/python3.11/dist-packages/datasets/builder.py in as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
1171 is_local = not is_remote_filesystem(self._fs)
1172 if not is_local:
→ 1173 raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).name } is not supported.")
1174 if not os.path.exists(self._output_dir):
1175 raise FileNotFoundError(
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
can anyone help?
1 Like
I found several similar cases.
Hi,
I have a Streamlit app that was working fine, which basically loads a dataset from HF (CSV-based dataset), but now, the app is crushed and shows the following error " NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.". I have no idea what causing this error.
Any ideas on how to solve this error?
Thank you.
opened 07:52PM - 08 May 24 UTC
### Describe the bug
with python 3.11 I execute:
```py
from transformers im… port Wav2Vec2Processor, Data2VecAudioModel
import torch
from torch import nn
from datasets import load_dataset, concatenate_datasets
# load demo audio and set processor
dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")
```
This fails in the last line with
```log
Found cached dataset librispeech_asr (file:///Users/as/Documents/Project/git/audio2vec/cache/librispeech_asr/clean-data_dir=data/2.1.0/cff5df6e7955c80a67f80e27e7e655de71c689e2d2364bece785b972acb37fe7)
Traceback (most recent call last):
File "/Users/as/Documents/Project/git/audio2vec/src/music2vec-v1.py", line 7, in <module>
dataset_clean = load_dataset("librispeech_asr", "clean", split="validation", data_dir="data", cache_dir="cache")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 1810, in load_dataset
ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/as/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1113, in as_dataset
raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).__name__} is not supported.")
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
```
### Steps to reproduce the bug
I setup an venv with requirements.txt
```txt
transformers==4.40.2
torch==2.2.2
datasets==2.16.0
fsspec==2023.9.2
```
pip freeze is:
```
aiohttp==3.9.5
aiosignal==1.3.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
datasets==2.16.0
dill==0.3.7
filelock==3.14.0
frozenlist==1.4.1
fsspec==2023.9.2
huggingface-hub==0.23.0
idna==3.7
Jinja2==3.1.4
MarkupSafe==2.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.3
numpy==1.26.4
packaging==24.0
pandas==2.2.2
pyarrow==16.0.0
pyarrow-hotfix==0.6
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.4.28
requests==2.31.0
safetensors==0.4.3
six==1.16.0
sympy==1.12
tokenizers==0.19.1
torch==2.2.2
tqdm==4.66.4
transformers==4.40.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
xxhash==3.4.1
yarl==1.9.4
```
I execute this on a M1 Mac.
### Expected behavior
I don't understand the error message. Why is "local" caching not supported. Would it possible to give some additional hint with the error message how to solve this issue?
### Environment info
source ....
python -u example.py
opened 01:34PM - 22 Sep 21 UTC
bug
## Describe the bug
Cache problem in the `load_dataset` method: when modifyin… g a compressed file in a local folder `load_dataset` doesn't detect the change and load the previous version.
## Steps to reproduce the bug
To test it directly, I have prepared a [Google Colaboratory notebook](https://colab.research.google.com/drive/11Em_Amoc-aPGhSBIkSHU2AvEh24nVayy?usp=sharing) that shows this behavior.
For this example, I have created a toy dataset at: https://sptcsesxqb.proxynodejs.usequeue.com/datasets/SaulLu/toy_struc_dataset
This dataset is composed of two versions:
- v1 on commit `a6beb46` which has a single example `{'id': 1, 'value': {'tag': 'a', 'value': 1}}` in file `train.jsonl.gz`
- v2 on commit `e7935f4` (`main` head) which has a single example `{'attr': 1, 'id': 1, 'value': 'a'}` in file `train.jsonl.gz`
With a terminal, we can start to get the v1 version of the dataset
```bash
git lfs install
git clone https://sptcsesxqb.proxynodejs.usequeue.com/datasets/SaulLu/toy_struc_dataset
cd toy_struc_dataset
git checkout a6beb46
```
Then we can load it with python and look at the content:
```python
from datasets import load_dataset
path = "/content/toy_struc_dataset"
dataset = load_dataset(path, data_files={"train": "*.jsonl.gz"})
print(dataset["train"][0])
```
Output
```
{'id': 1, 'value': {'tag': 'a', 'value': 1}} # This is the example in v1
```
With a terminal, we can now start to get the v1 version of the dataset
```bash
git checkout main
```
Then we can load it with python and look at the content:
```python
from datasets import load_dataset
path = "/content/toy_struc_dataset"
dataset = load_dataset(path, data_files={"train": "*.jsonl.gz"})
print(dataset["train"][0])
```
Output
```
{'id': 1, 'value': {'tag': 'a', 'value': 1}} # This is the example in v1 (not v2)
```
## Expected results
The last output should have been
```
{"id":1, "value": "a", "attr": 1} # This is the example in v2
```
## Ideas
As discussed offline with Quentin, if the cache hash was ever sensitive to changes in a compressed file we would probably not have the problem anymore.
This situation leads me to suggest 2 other features:
- to also have an `load_from_cache_file` argument in the "load_dataset" method
- to reorganize the cache so that we can delete the caches related to a dataset (cf issue #ToBeFilledSoon)
And thanks again for this great library :hugs:
## Environment info
- `datasets` version: 1.12.1
- Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.12
- PyArrow version: 3.0.0
Try like this:
dataset = load_dataset(“text”, data_files=data_files, download_mode="force_redownload", split={
“train”: “train”,
“validation”: “validation”,
“test”: “test”
})
and just in case:
pip install -U datasets
the pip install dataset worked, thank you so much!!!
1 Like