Machine Learning Datasets Machine Learning Datasets
  • GitHub
  • Slack
  • Documentation
  • Datasets
    • MNIST
    • ImageNet Dataset
    • COCO Dataset
    • CIFAR 10 Dataset
    • CIFAR 100 Dataset
    • FFHQ Dataset
    • Places205 Dataset
    • GTZAN Genre Dataset
    • GTZAN Music Speech Dataset
    • The Street View House Numbers (SVHN) Dataset
    • Caltech 101 Dataset
    • LibriSpeech Dataset
    • dSprites Dataset
    • PUCPR Dataset
    • RAVDESS Dataset
    • GTSRB Dataset
    • CSSD Dataset
    • ATIS Dataset
    • Free Spoken Digit Dataset (FSDD)
    • not-MNIST Dataset
    • ECSSD Dataset
    • COCO-Text Dataset
    • CoQA Dataset
    • FGNET Dataset
    • ESC-50 Dataset
    • GlaS Dataset
    • UTZappos50k Dataset
    • Pascal VOC 2012 Dataset
    • Pascal VOC 2007 Dataset
    • Omniglot Dataset
    • HMDB51 Dataset
    • Chest X-Ray Image Dataset
    • NIH Chest X-ray Dataset
    • Fashionpedia Dataset
    • DRIVE Dataset
    • Kaggle Cats & Dogs Dataset
    • Lincolnbeet Dataset
    • Sentiment-140 Dataset
    • MURA Dataset
    • LIAR Dataset
    • Stanford Cars Dataset
    • SWAG Dataset
    • HASYv2 Dataset
    • WFLW Dataset
    • Visdrone Dataset
    • 11k Hands Dataset
    • QuAC Dataset
    • LFW Deep Funneled Dataset
    • LFW Funneled Dataset
    • Office-Home Dataset
    • LFW Dataset
    • PlantVillage Dataset
    • Optical Handwritten Digits Dataset
    • UCI Seeds Dataset
    • STN-PLAD Dataset
    • FER2013 Dataset
    • Adience Dataset
    • PPM-100 Dataset
    • CelebA Dataset
    • Fashion MNIST Dataset
    • Google Objectron Dataset
    • CARPK Dataset
    • CACD Dataset
    • Flickr30k Dataset
    • Kuzushiji-Kanji (KKanji) dataset
    • KMNIST
    • EMNIST Dataset
    • USPS Dataset
    • MARS Dataset
    • HICO Classification Dataset
    • NSynth Dataset
    • RESIDE dataset
    • Electricity Dataset
    • DRD Dataset
    • Caltech 256 Dataset
    • AFW Dataset
    • PACS Dataset
    • TIMIT Dataset
    • KTH Actions Dataset
    • WIDER Face Dataset
    • WISDOM Dataset
    • DAISEE Dataset
    • WIDER Dataset
    • LSP Dataset
    • UCF Sports Action Dataset
    • Wiki Art Dataset
    • FIGRIM Dataset
    • ANIMAL (ANIMAL10N) Dataset
    • OPA Dataset
    • DomainNet Dataset
    • HAM10000 Dataset
    • Tiny ImageNet Dataset
    • Speech Commands Dataset
    • 300w Dataset
    • Food 101 Dataset
    • VCTK Dataset
    • LOL Dataset
    • AQUA Dataset
    • LFPW Dataset
    • ARID Video Action dataset
    • NABirds Dataset
    • SQuAD Dataset
    • ICDAR 2013 Dataset
    • Animal Pose Dataset
Get Started
Machine Learning Datasets Machine Learning Datasets
Get Started
Machine Learning Datasets
  • GitHub
  • Slack
  • Documentation
  • Datasets
    • MNIST
    • ImageNet Dataset
    • COCO Dataset
    • CIFAR 10 Dataset
    • CIFAR 100 Dataset
    • FFHQ Dataset
    • Places205 Dataset
    • GTZAN Genre Dataset
    • GTZAN Music Speech Dataset
    • The Street View House Numbers (SVHN) Dataset
    • Caltech 101 Dataset
    • LibriSpeech Dataset
    • dSprites Dataset
    • PUCPR Dataset
    • RAVDESS Dataset
    • GTSRB Dataset
    • CSSD Dataset
    • ATIS Dataset
    • Free Spoken Digit Dataset (FSDD)
    • not-MNIST Dataset
    • ECSSD Dataset
    • COCO-Text Dataset
    • CoQA Dataset
    • FGNET Dataset
    • ESC-50 Dataset
    • GlaS Dataset
    • UTZappos50k Dataset
    • Pascal VOC 2012 Dataset
    • Pascal VOC 2007 Dataset
    • Omniglot Dataset
    • HMDB51 Dataset
    • Chest X-Ray Image Dataset
    • NIH Chest X-ray Dataset
    • Fashionpedia Dataset
    • DRIVE Dataset
    • Kaggle Cats & Dogs Dataset
    • Lincolnbeet Dataset
    • Sentiment-140 Dataset
    • MURA Dataset
    • LIAR Dataset
    • Stanford Cars Dataset
    • SWAG Dataset
    • HASYv2 Dataset
    • WFLW Dataset
    • Visdrone Dataset
    • 11k Hands Dataset
    • QuAC Dataset
    • LFW Deep Funneled Dataset
    • LFW Funneled Dataset
    • Office-Home Dataset
    • LFW Dataset
    • PlantVillage Dataset
    • Optical Handwritten Digits Dataset
    • UCI Seeds Dataset
    • STN-PLAD Dataset
    • FER2013 Dataset
    • Adience Dataset
    • PPM-100 Dataset
    • CelebA Dataset
    • Fashion MNIST Dataset
    • Google Objectron Dataset
    • CARPK Dataset
    • CACD Dataset
    • Flickr30k Dataset
    • Kuzushiji-Kanji (KKanji) dataset
    • KMNIST
    • EMNIST Dataset
    • USPS Dataset
    • MARS Dataset
    • HICO Classification Dataset
    • NSynth Dataset
    • RESIDE dataset
    • Electricity Dataset
    • DRD Dataset
    • Caltech 256 Dataset
    • AFW Dataset
    • PACS Dataset
    • TIMIT Dataset
    • KTH Actions Dataset
    • WIDER Face Dataset
    • WISDOM Dataset
    • DAISEE Dataset
    • WIDER Dataset
    • LSP Dataset
    • UCF Sports Action Dataset
    • Wiki Art Dataset
    • FIGRIM Dataset
    • ANIMAL (ANIMAL10N) Dataset
    • OPA Dataset
    • DomainNet Dataset
    • HAM10000 Dataset
    • Tiny ImageNet Dataset
    • Speech Commands Dataset
    • 300w Dataset
    • Food 101 Dataset
    • VCTK Dataset
    • LOL Dataset
    • AQUA Dataset
    • LFPW Dataset
    • ARID Video Action dataset
    • NABirds Dataset
    • SQuAD Dataset
    • ICDAR 2013 Dataset
    • Animal Pose Dataset

Machine Learning Datasets

  • Folder icon closed Folder open iconDatasets
    • MNIST
    • ImageNet Dataset
    • COCO Dataset
    • CIFAR 10 Dataset
    • CIFAR 100 Dataset
    • FFHQ Dataset
    • Places205 Dataset
    • GTZAN Genre Dataset
    • GTZAN Music Speech Dataset
    • The Street View House Numbers (SVHN) Dataset
    • Caltech 101 Dataset
    • LibriSpeech Dataset
    • dSprites Dataset
    • PUCPR Dataset
    • RAVDESS Dataset
    • GTSRB Dataset
    • CSSD Dataset
    • ATIS Dataset
    • Free Spoken Digit Dataset (FSDD)
    • not-MNIST Dataset
    • ECSSD Dataset
    • COCO-Text Dataset
    • CoQA Dataset
    • FGNET Dataset
    • ESC-50 Dataset
    • GlaS Dataset
    • UTZappos50k Dataset
    • Pascal VOC 2012 Dataset
    • Pascal VOC 2007 Dataset
    • Omniglot Dataset
    • HMDB51 Dataset
    • Chest X-Ray Image Dataset
    • NIH Chest X-ray Dataset
    • Fashionpedia Dataset
    • DRIVE Dataset
    • Kaggle Cats & Dogs Dataset
    • Lincolnbeet Dataset
    • Sentiment-140 Dataset
    • MURA Dataset
    • LIAR Dataset
    • Stanford Cars Dataset
    • SWAG Dataset
    • HASYv2 Dataset
    • WFLW Dataset
    • Visdrone Dataset
    • 11k Hands Dataset
    • QuAC Dataset
    • LFW Deep Funneled Dataset
    • LFW Funneled Dataset
    • Office-Home Dataset
    • LFW Dataset
    • PlantVillage Dataset
    • Optical Handwritten Digits Dataset
    • UCI Seeds Dataset
    • STN-PLAD Dataset
    • FER2013 Dataset
    • Adience Dataset
    • PPM-100 Dataset
    • CelebA Dataset
    • Fashion MNIST Dataset
    • Google Objectron Dataset
    • CARPK Dataset
    • CACD Dataset
    • Flickr30k Dataset
    • Kuzushiji-Kanji (KKanji) dataset
    • KMNIST
    • EMNIST Dataset
    • USPS Dataset
    • MARS Dataset
    • HICO Classification Dataset
    • NSynth Dataset
    • RESIDE dataset
    • Electricity Dataset
    • DRD Dataset
    • Caltech 256 Dataset
    • AFW Dataset
    • PACS Dataset
    • TIMIT Dataset
    • KTH Actions Dataset
    • WIDER Face Dataset
    • WISDOM Dataset
    • DAISEE Dataset
    • WIDER Dataset
    • LSP Dataset
    • UCF Sports Action Dataset
    • Wiki Art Dataset
    • FIGRIM Dataset
    • ANIMAL (ANIMAL10N) Dataset
    • OPA Dataset
    • DomainNet Dataset
    • HAM10000 Dataset
    • Tiny ImageNet Dataset
    • Speech Commands Dataset
    • 300w Dataset
    • Food 101 Dataset
    • VCTK Dataset
    • LOL Dataset
    • AQUA Dataset
    • LFPW Dataset
    • ARID Video Action dataset
    • NABirds Dataset
    • SQuAD Dataset
    • ICDAR 2013 Dataset
    • Animal Pose Dataset
  • Folder icon closed Folder open iconDeep Lake Docs Home
  • Folder icon closed Folder open iconDataset Visualization
  • Folder icon closed Folder open iconAPI Basics
  • Folder icon closed Folder open iconStorage & Credentials
  • Folder icon closed Folder open iconGetting Started
  • Folder icon closed Folder open iconTutorials (w Colab)
  • Folder icon closed Folder open iconPlaybooks
  • Folder icon closed Folder open iconData Layout
  • Folder icon closed Folder open iconShuffling in ds.pytorch()
  • Folder icon closed Folder open iconStorage Synchronization
  • Folder icon closed Folder open iconTensor Relationships
  • Folder icon closed Folder open iconQuickstart
  • Folder icon closed Folder open iconHow to Contribute
Datasets

LibriSpeech Dataset

Estimated reading: 5 minutes 6004 views

Visualize & query of the LibriSpeech Dataset in the Deep Lake UI

LibriSpeech Dataset

What is LibriSpeech Dataset?

LibriSpeech is part of the LibriVox project and contains approximately 1,000 hours of audiobooks. Audiobooks from Project Gutenberg make up the majority of the collection. To produce a corpus of English read speech suitable for training speech recognition systems, LibriSpeech aligns and segments audiobook read speech with the corresponding book text automatically, filters out segments with noisy transcripts, and produces the corpus.

Download LibriSpeech Dataset in Python

Instead of downloading the LibriSpeech dataset in Python, you can effortlessly load it in Python via our Deep Lake open-source with just one line of code.

Load LibriSpeech Dataset Training Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-train-clean-360')

				
			

OR

				
					ds = deeplake.load('hub://activeloop/LibriSpeech-train-clean-100')

				
			

Load LibriSpeech Dataset Testing Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-test-clean')

				
			

OR

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-test-other')

				
			

Load LibriSpeech Dataset Developer Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-dev-clean')

				
			

OR

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech--dev-other')

				
			

LibriSpeech Dataset Structure

LibriSpeech Data Fields
  • file: A path to the downloaded audio file in .flac format.
  • audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • text: the transcription of the audio file.
  • id: unique id of the data sample.
  • speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
  • chapter_id: id of the audiobook chapter, which includes the transcription.
LibriSpeech Data Splits
  • The LibriSpeech dataset training set is separated into LibriSpeech dataset training 100-hour and LibriSpeech dataset training 360-hour sets.
  • The LibriSpeech dataset testing set is separated into LibriSpeech dataset training clean and LibriSpeech dataset training other categories.
  • The LibriSpeech dataset developer set is separated into LibriSpeech dataset developer clean and LibriSpeech dataset developer other categories. 

‘Clean’ and ‘other’ categories are assigned depending on how well or challenging Automatic Speech Recognition systems would perform. Each development and test set has about five hours of audio.

How to use LibriSpeech Dataset with PyTorch and TensorFlow in Python

Train a model on the LibriSpeech dataset with PyTorch in Python

Let’s use Deep Lake’s built-in PyTorch one-line dataloader to connect the data to the compute:

				
					dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)

				
			
Train a model on LibriSpeech dataset with TensorFlow in Python
				
					dataloader = ds.tensorflow()

				
			

LibriSpeech Dataset Creation

Data Collection and Normalization Information

Using LibriVox’s API5, we collect information about readers, audiobook projects they participated in, and chapters of books they read for inclusion in the corpus. We used LibriVox API and Internet Archive metadata files to match up audio and reference text URLs. There were a few audiobooks for which there was no exact match in Project Gutenberg, so we allowed fuzzy matches.

To ensure there was no speaker overlap between the training, development, and test sets, the authors provided each recording was unambiguously attributable to a single speaker. Also, the data available for each gender was balanced based on speaker diversity.

Additional Information about LibriSpeech Dataset

LibriSpeech Dataset Description

  1. Homepage: https://www.openslr.org/12
  2. Repository: N/A
  3. Paper: Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur: LibriSpeech: an ASR corpus based on public domain audiobooks, ICASSP 2015.
  4. Point of Contact: http://www.openslr.org/
LibriSpeech Dataset Curators

Vassil Panayotov and Daniel Povey 

LibriSpeech Dataset Licensing Information

CC BY 4.0

LibriSpeech Dataset Citation Information
				
					@inproceedings{panayotov2015librispeech,
     title={Librispeech: an ASR corpus based on public domain audio books},
     author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
     booktitle={Acoustics, Speech, and Signal Processing (ICASSP), 2015 IEEE International Conference on},
     pages={5206--5210},
     year={2015},
     organization={IEEE}}

				
			

LibriSpeech Dataset FAQs

What is the LibriSpeech dataset for Python?

LibriSpeech consists of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from reading audiobooks from the LibriVox project and has been carefully segmented and aligned.

What is LibriSpeech ASR corpus?

The LibriSpeech ASR corpus uses audiobooks from LibriVox to generate speech. This corpus is freely available under the CC BY 4.0 license and represents approximately 8000 public domain audiobooks, most of which are in English. 

What is LibriSpeech ASR corpus used for?

LibriSpeech ASR corpus can be used to train high-quality acoustic models and can also be used in NLP language models.

How to download the LibriSpeech dataset in Python?

You can load the LibriSpeech dataset fast with one line of code using the open-source package Deep Lake in Python. See detailed instructions on loading the LibriSpeech dataset training subset, the LibriSpeech dataset testing subset in Python, or the LibriSpeech dataset developer subset in Python.

How can I use the LibriSpeech dataset in PyTorch or TensorFlow?

You can stream a LibriSpeech dataset while training a model in PyTorch or TensorFlow with one line of code using the open-source package Deep Lake in Python. See detailed instructions on how to prepare a model on the LibriSpeech dataset with PyTorch in Python or train a model on the LibriSpeech dataset with TensorFlow in Python.

Datasets - Previous Caltech 101 Dataset Next - Datasets dSprites Dataset
Share this Doc

LibriSpeech Dataset

Or copy link

Clipboard Icon
CONTENTS
Leaf Illustration

© 2024 All Rights Reserved by Snark AI, inc dba Activeloop