LibriSpeech Dataset

Estimated reading: 5 minutes 6474 views

Visualize & query of the LibriSpeech Dataset in the Deep Lake UI

LibriSpeech Dataset

What is LibriSpeech Dataset?

LibriSpeech is part of the LibriVox project and contains approximately 1,000 hours of audiobooks. Audiobooks from Project Gutenberg make up the majority of the collection. To produce a corpus of English read speech suitable for training speech recognition systems, LibriSpeech aligns and segments audiobook read speech with the corresponding book text automatically, filters out segments with noisy transcripts, and produces the corpus.

Download LibriSpeech Dataset in Python

Instead of downloading the LibriSpeech dataset in Python, you can effortlessly load it in Python via our Deep Lake open-source with just one line of code.

Load LibriSpeech Dataset Training Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-train-clean-360')

				
					ds = deeplake.load('hub://activeloop/LibriSpeech-train-clean-100')

Load LibriSpeech Dataset Testing Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-test-clean')

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-test-other')

Load LibriSpeech Dataset Developer Subset in Python

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech-dev-clean')

				
					import deeplake 
ds = deeplake.load('hub://activeloop/LibriSpeech--dev-other')

LibriSpeech Dataset Structure

LibriSpeech Data Fields

file: A path to the downloaded audio file in .flac format.
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
text: the transcription of the audio file.
id: unique id of the data sample.
speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
chapter_id: id of the audiobook chapter, which includes the transcription.

LibriSpeech Data Splits

The LibriSpeech dataset training set is separated into LibriSpeech dataset training 100-hour and LibriSpeech dataset training 360-hour sets.
The LibriSpeech dataset testing set is separated into LibriSpeech dataset training clean and LibriSpeech dataset training other categories.
The LibriSpeech dataset developer set is separated into LibriSpeech dataset developer clean and LibriSpeech dataset developer other categories.

‘Clean’ and ‘other’ categories are assigned depending on how well or challenging Automatic Speech Recognition systems would perform. Each development and test set has about five hours of audio.

How to use LibriSpeech Dataset with PyTorch and TensorFlow in Python

Train a model on the LibriSpeech dataset with PyTorch in Python

Let’s use Deep Lake’s built-in PyTorch one-line dataloader to connect the data to the compute:

				
					dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)

Train a model on LibriSpeech dataset with TensorFlow in Python

				
					dataloader = ds.tensorflow()

LibriSpeech Dataset Creation

Data Collection and Normalization Information

Using LibriVox’s API5, we collect information about readers, audiobook projects they participated in, and chapters of books they read for inclusion in the corpus. We used LibriVox API and Internet Archive metadata files to match up audio and reference text URLs. There were a few audiobooks for which there was no exact match in Project Gutenberg, so we allowed fuzzy matches.

To ensure there was no speaker overlap between the training, development, and test sets, the authors provided each recording was unambiguously attributable to a single speaker. Also, the data available for each gender was balanced based on speaker diversity.

Additional Information about LibriSpeech Dataset

LibriSpeech Dataset Description

Homepage: https://www.openslr.org/12
Repository: N/A
Paper: Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur: LibriSpeech: an ASR corpus based on public domain audiobooks, ICASSP 2015.
Point of Contact: http://www.openslr.org/

LibriSpeech Dataset Curators

Vassil Panayotov and Daniel Povey

LibriSpeech Dataset Licensing Information

CC BY 4.0

LibriSpeech Dataset Citation Information

				
					@inproceedings{panayotov2015librispeech,
     title={Librispeech: an ASR corpus based on public domain audio books},
     author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
     booktitle={Acoustics, Speech, and Signal Processing (ICASSP), 2015 IEEE International Conference on},
     pages={5206--5210},
     year={2015},
     organization={IEEE}}

LibriSpeech Dataset FAQs

What is the LibriSpeech dataset for Python?

LibriSpeech consists of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from reading audiobooks from the LibriVox project and has been carefully segmented and aligned.

What is LibriSpeech ASR corpus?

The LibriSpeech ASR corpus uses audiobooks from LibriVox to generate speech. This corpus is freely available under the CC BY 4.0 license and represents approximately 8000 public domain audiobooks, most of which are in English.

What is LibriSpeech ASR corpus used for?

LibriSpeech ASR corpus can be used to train high-quality acoustic models and can also be used in NLP language models.

How to download the LibriSpeech dataset in Python?

You can load the LibriSpeech dataset fast with one line of code using the open-source package Deep Lake in Python. See detailed instructions on loading the LibriSpeech dataset training subset, the LibriSpeech dataset testing subset in Python, or the LibriSpeech dataset developer subset in Python.

How can I use the LibriSpeech dataset in PyTorch or TensorFlow?

You can stream a LibriSpeech dataset while training a model in PyTorch or TensorFlow with one line of code using the open-source package Deep Lake in Python. See detailed instructions on how to prepare a model on the LibriSpeech dataset with PyTorch in Python or train a model on the LibriSpeech dataset with TensorFlow in Python.

LibriSpeech Dataset

LibriSpeech Dataset

What is LibriSpeech Dataset?

Download LibriSpeech Dataset in Python

Load LibriSpeech Dataset Training Subset in Python

Load LibriSpeech Dataset Testing Subset in Python

Load LibriSpeech Dataset Developer Subset in Python

LibriSpeech Dataset Structure

LibriSpeech Data Fields

LibriSpeech Data Splits

How to use LibriSpeech Dataset with PyTorch and TensorFlow in Python

Train a model on the LibriSpeech dataset with PyTorch in Python

Train a model on LibriSpeech dataset with TensorFlow in Python

LibriSpeech Dataset Creation

Data Collection and Normalization Information

Additional Information about LibriSpeech Dataset

LibriSpeech Dataset Description

LibriSpeech Dataset Curators

LibriSpeech Dataset Licensing Information

LibriSpeech Dataset Citation Information

LibriSpeech Dataset FAQs

What is the LibriSpeech dataset for Python?

What is LibriSpeech ASR corpus?

What is LibriSpeech ASR corpus used for?

How to download the LibriSpeech dataset in Python?

How can I use the LibriSpeech dataset in PyTorch or TensorFlow?

LibriSpeech Dataset

CONTENTS