Machine Learning Datasets Machine Learning Datasets
  • GitHub
  • Slack
  • Documentation
Get Started
Machine Learning Datasets Machine Learning Datasets
Get Started
Machine Learning Datasets
  • GitHub
  • Slack
  • Documentation

Docy

Machine Learning Datasets

  • Folder icon closed Folder open iconDatasets
    • MNIST
    • ImageNet Dataset
    • COCO Dataset
    • CIFAR 10 Dataset
    • CIFAR 100 Dataset
    • FFHQ Dataset
    • Places205 Dataset
    • GTZAN Genre Dataset
    • GTZAN Music Speech Dataset
    • The Street View House Numbers (SVHN) Dataset
    • Caltech 101 Dataset
    • LibriSpeech Dataset
    • dSprites Dataset
    • PUCPR Dataset
    • RAVDESS Dataset
    • GTSRB Dataset
    • CSSD Dataset
    • ATIS Dataset
    • Free Spoken Digit Dataset (FSDD)
    • not-MNIST Dataset
    • ECSSD Dataset
    • COCO-Text Dataset
    • CoQA Dataset
    • FGNET Dataset
    • ESC-50 Dataset
    • GlaS Dataset
    • UTZappos50k Dataset
    • Pascal VOC 2012 Dataset
    • Pascal VOC 2007 Dataset
    • Omniglot Dataset
    • HMDB51 Dataset
    • Chest X-Ray Image Dataset
    • NIH Chest X-ray Dataset
    • Fashionpedia Dataset
    • DRIVE Dataset
    • Kaggle Cats & Dogs Dataset
    • Lincolnbeet Dataset
    • Sentiment-140 Dataset
    • MURA Dataset
    • LIAR Dataset
    • Stanford Cars Dataset
    • SWAG Dataset
    • HASYv2 Dataset
    • WFLW Dataset
    • Visdrone Dataset
    • 11k Hands Dataset
    • QuAC Dataset
    • LFW Deep Funneled Dataset
    • LFW Funneled Dataset
    • Office-Home Dataset
    • LFW Dataset
    • PlantVillage Dataset
    • Optical Handwritten Digits Dataset
    • UCI Seeds Dataset
    • STN-PLAD Dataset
    • FER2013 Dataset
    • Adience Dataset
    • PPM-100 Dataset
    • CelebA Dataset
    • Fashion MNIST Dataset
    • Google Objectron Dataset
    • CARPK Dataset
    • CACD Dataset
    • Flickr30k Dataset
    • Kuzushiji-Kanji (KKanji) dataset
    • KMNIST
    • EMNIST Dataset
    • USPS Dataset
    • MARS Dataset
    • HICO Classification Dataset
    • NSynth Dataset
    • RESIDE dataset
    • Electricity Dataset
    • DRD Dataset
    • Caltech 256 Dataset
    • AFW Dataset
    • PACS Dataset
    • TIMIT Dataset
    • KTH Actions Dataset
    • WIDER Face Dataset
    • WISDOM Dataset
    • DAISEE Dataset
    • WIDER Dataset
    • LSP Dataset
    • UCF Sports Action Dataset
    • Wiki Art Dataset
    • FIGRIM Dataset
    • ANIMAL (ANIMAL10N) Dataset
    • OPA Dataset
    • DomainNet Dataset
    • HAM10000 Dataset
    • Tiny ImageNet Dataset
    • Speech Commands Dataset
    • 300w Dataset
    • Food 101 Dataset
    • VCTK Dataset
    • LOL Dataset
    • AQUA Dataset
    • LFPW Dataset
    • ARID Video Action dataset
    • NABirds Dataset
    • SQuAD Dataset
    • ICDAR 2013 Dataset
    • Animal Pose Dataset
  • Folder icon closed Folder open iconDeep Lake Docs Home
  • Folder icon closed Folder open iconDataset Visualization
  • API Basics
  • Storage & Credentials
  • Getting Started
  • Tutorials (w Colab)
  • Playbooks
  • Data Layout
  • Folder icon closed Folder open iconShuffling in ds.pytorch()
  • Folder icon closed Folder open iconStorage Synchronization
  • Folder icon closed Folder open iconTensor Relationships
  • Folder icon closed Folder open iconQuickstart
  • Folder icon closed Folder open iconHow to Contribute

Kuzushiji-Kanji (KKanji) dataset

Estimated reading: 5 minutes

Visualization of the Kuzushiji Kanji Dataset in the Deep Lake UI

Kuzushiji Kanji (KKanji) dataset

What is Kuzushiji Kanji (KKanji) Dataset?

The Kuzushiji Kanji (KKanji) dataset contains 140,426 images of Kanji characters (Kuzushiji is a Japanese writing style in cursive). It is a large and highly imbalanced 64×64 grayscale image dataset. Its distribution ranges from 1,766 examples per class to only a single example per class.

Download Kuzushiji Kanji (KKanji) Dataset in Python

Instead of downloading the KKanji dataset in Python, you can effortlessly load it in Python via our Deep Lake open-source with just one line of code.

Load Kuzushiji Kanji (KKanji) Dataset Training Subset in Python

				
					import deeplake
ds = deeplake.load("hub://activeloop/kuzushiji-kanji")
				
			

Kuzushiji Kanji (KKanji) Dataset Structure

Kuzushiji Kanji (KKanji) Data Fields
  • image: tensor containing the 64×64 image.
  • label: an integer between 0 and 3831 representing the Kanji Character.

How to use Kuzushiji Kanji (KKanji) Dataset with PyTorch and TensorFlow in Python

Train a model on Kuzushiji Kanji (KKanji) dataset with PyTorch in Python

Let’s use Deep Lake built-in PyTorch one-line dataloader to connect the data to the compute:

				
					dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)
				
			
Train a model on Kuzushiji Kanji (KKanji) dataset with TensorFlow in Python
				
					dataloader = ds.tensorflow()
				
			

Kuzushiji Kanji (KKanji) Dataset Creation

Data Collection Information

Kusushiji Kanji is one of the three Kuzushiji-MNIST datasets(Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji) created by the National Institute of Japanese Literature (NIJL)and curated by the Center for Open Data in the Humanities (CODH). A bounding box was created for each character during the transcription process, but literature scholars did not think they were worth sharing. From the perspective of machine learning, CODH suggested making a separate dataset for bounding boxes on a page. This is because that can be used as the basis for many machine learning challenges and working towards automated transcription. This resulted in the full release of the Kuzushiji dataset in November 2016. The dataset contains 3,999 character types along with 403,242 characters

Additional Information about Kuzushiji Kanji (KKanji) Dataset

Kuzushiji Kanji (KKanji) Dataset Description

  • Homepage: http://codh.rois.ac.jp/kmnist/index.html.en
  • Repository: https://github.com/rois-codh/kmnist
  • Paper: Deep Learning for Classical Japanese Literature. Tarin Clanuwat et al. arXiv:1812.01718
  • Point of Contact: http://codh.rois.ac.jp/feedback/
Kuzushiji Kanji (KKanji) Dataset Curators

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto and David Ha

Kuzushiji Kanji (KKanji) Dataset Licensing Information

CC BY-SA 4.0 License

Kuzushiji Kanji (KKanji) Dataset Citation Information
				
					@online{clanuwat2018deep,
  author       = {Tarin Clanuwat and Mikel Bober-Irizar and Asanobu Kitamoto and Alex Lamb and Kazuaki Yamamoto and David Ha},
  title        = {Deep Learning for Classical Japanese Literature},
  date         = {2018-12-03},
  year         = {2018},
  eprintclass  = {cs.CV},
  eprinttype   = {arXiv},
  eprint       = {cs.CV/1812.01718},
}Kuzushiji-Kanji (KKanji) Dataset FAQs

				
			

Kuzushiji Kanji (KKanji) Dataset FAQs

What is the Kuzushiji Kanji (KKanji) dataset for Python?

The Kuzushiji-Kanji dataset is a Machine Learning dataset of the Kanji characters. It is a dataset of 140,426 square 64×64 pixel images of handwritten kanji characters labeled between 0 and 3831. The images are in grayscale format.

What is the Kuzushiji Kanji (KKanji) dataset used for

Kuzushiji-Kanji is used as a popular dataset of Kanji Characters used in the Japanese Language.

How to download the Kuzushiji Kanji (KKanji) dataset in Python?

You can load the Kuzushiji-Kanji dataset fast with one line of code using the open-source package Activeloop Deep Lake in Python. See detailed instructions on how to Load the Kzushiji Kanji dataset in Python.

How can I use Kuzushiji Kanji (KKanji) dataset in PyTorch or TensorFlow?

You can stream the Kuzushiji-Kanji dataset while training a model in PyTorch or TensorFlow with one line of code using the open-source package Activeloop Deep Lake in Python. See detailed instructions on how to train a model on the Kuzushiji-Kanji dataset with PyTorch in Python or train a model on the Kuzushiji-Kanji dataset with TensorFlow in Python.

Should I work with Kuzushiji Kanji (KKanji) dataset in CSV?

No. CSV is not optimized for working with image data, especially for machine learning workflows. Instead of downloading the Kuzushiji-Kanji dataset CSV, you easily load, version-control, query, and manipulate Kuzushiji-Kanji for machine learning purposes using Activeloop Deep Lake.

How to create an Image Dataset like Kuzushiji Kanji (KKanji) dataset?

With Activeloop Deep Lake, creating image datasets like the Kuzushiji-Kanji character dataset is simple. Simple datasets like Kuzushiji-Kanji can be created automatically by allowing Deep Lake to parse the legacy files into the Deep Lake dataset format. More complex datasets can be created manually.

Kuzushiji-Kanji vs Kuzushiji-MNIST. What is the difference between Kuzushiji-Kanji and Kuzushiji-MNIST?

Kuzushiji-Kanji and Kuzushiji-MNIST are two separate datasets. The Kuzushiji-MNIST dataset is meant to be a drop-in replacement to MNIST dataset. It contains 28×28 grayscale, and 70,000 images, similar to MNIST. While MNIST has 10 classes, in the Japanese Language there are 48 Hiragana characters and one Hiragana iteration mark. Hence, one Hiragana character was chosen to represent 10 rows of Hiragana.

On the other hand, Kuzushiji-Kanji is a large and highly imbalanced Dataset with the sole purpose to provide a detailed Dataset for Kanji Characters. The high-class imbalance of Kuzushiji-Kanji is because of the frequency of appearance in the real books from which the data was sourced, and kept that way to represent the real data distribution.

What is the size of each image in the Kuzushiji-Kanji (KKanji) dataset?

The Kuzushiji-Kanji dataset image size is constant across all images of the dataset. Each Kuzushiji-Kanji dataset image is a fixed-size 64×64 pixel square image.

Deep Lake community member Uday Uppal has contributed to this dataset. You rock, Uday! 🙂

Datasets - Previous Flickr30k Dataset Next - Datasets KMNIST
Leaf Illustration

© 2022 All Rights Reserved by Snark AI, inc dba Activeloop