Settings

Theme

TensorFlow Datasets

github.com

126 points by markerbrod 3 years ago · 13 comments

Reader

jamesblonde 3 years ago

I actually prefer Hugging Face Datasets - there's 16k+ of them today: https://huggingface.co/datasets

  • soraki_soladead 3 years ago

    Quantity of datasets doesn’t seem like the right metric. The library just needs the datasets you care about and both libraries have the popular ones. What’s more important is integration and if you’re training custom TF models then tfds will generally integrate more smoothly than huggingface.

    • albertzeyer 3 years ago

      I tried Librispeech, a very common dataset for speech recognition, in both HF and TFDS.

      TFDS performed extremely bad.

      First it failed because the official hosting server only allows 5 simultaneous connections, and TFDS totally ignored that and makes up to 50 simultaneous downloads and that breaks. I wonder if anyone actually tested this?

      Then you need to have some computer with 30GB to do the preparation, which might fail on your computer. This is where I stopped. https://github.com/tensorflow/datasets/issues/3887. It might be fixed now but it took them 8 months to respond to my issue.

      On HF, it just worked. There was a smaller issue in how the dataset was split up but that is fixed now, and their response was very fast and great.

    • jachian 3 years ago

      and as well as discoverability / searchability. how easily it is to find what you're looking for

xnx 3 years ago

Great resource. My experience has been that any data project is at least 1/3 data collection/preparation, 1/3 using the right tool the right way, and 1/3 asking the right questions and interpreting the outcome.

pj_mukh 3 years ago

Direct link to the list of datasets: https://www.tensorflow.org/datasets/catalog/overview#all_dat...

Would love a direct Google Photos style search method for especially the visual datasets.

AyyWS 3 years ago

Kaggle is what they talk about in my industry.

https://www.kaggle.com/datasets

yeldarb 3 years ago

For computer vision, there are 100k+ open source classification, object detection, and segmentation datasets available on Roboflow Universe: https://universe.roboflow.com

  • throwaway20222 3 years ago

    So many of those have tiny datasets - like 30 images that are seemingly of low quality. I love roboflow, but those are really hard to work with. I wish there was an open platform for generating the datasets that was cost effective.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection