Kuzushiji-Kanji (KKanji) dataset

Estimated reading: 5 minutes 169 views

Visualization of the Kuzushiji Kanji Dataset  on the Deep Lake UI

Kuzushiji Kanji (KKanji) dataset

What is Kuzushiji Kanji (KKanji) Dataset?

The Kuzushiji Kanji (KKanji) dataset contains 140,426 images of Kanji characters (Kuzushiji is a Japanese writing style in cursive). It is a large and highly imbalanced 64×64 grayscale image dataset. Its distribution ranges from 1,766 examples per class to only a single example per class.

Download Kuzushiji Kanji (KKanji) Dataset in Python

Instead of downloading the KKanji dataset in Python, you can effortlessly load it in Python via our Deep Lake open-source with just one line of code.

Load Kuzushiji Kanji (KKanji) Dataset Training Subset in Python

				
					import deeplake
ds = deeplake.load("hub://activeloop/kuzushiji-kanji")
				
			

Kuzushiji Kanji (KKanji) Dataset Structure

Kuzushiji Kanji (KKanji) Data Fields
  • image: tensor containing the 64×64 image.
  • label: an integer between 0 and 3831 representing the Kanji Character.

How to use Kuzushiji Kanji (KKanji) Dataset with PyTorch and TensorFlow in Python

Train a model on Kuzushiji Kanji (KKanji) dataset with PyTorch in Python

Let’s use Deep Lake built-in PyTorch one-line dataloader to connect the data to the compute:

				
					dataloader = ds.pytorch(num_workers=0, batch_size=4, shuffle=False)
				
			
Train a model on Kuzushiji Kanji (KKanji) dataset with TensorFlow in Python
				
					dataloader = ds.tensorflow()
				
			

Kuzushiji Kanji (KKanji) Dataset Creation

Data Collection Information

Kusushiji Kanji is one of the three Kuzushiji-MNIST datasets(Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji) created by the National Institute of Japanese Literature (NIJL)and curated by Center for Open Data in the Humanities (CODH). A bounding box was created for each character during the transcription process, but literature scholars did not think they were worth sharing. From the perspective of machine learning, CODH suggested to make a separate dataset for bounding boxes on a page. This is because that can be used as the basis for many machine learning challenges and working towards automated transcription. This resulted in the full release of the Kuzushiji dataset in November 2016. The dataset contains 3,999 character types along with 403,242 characters

Additional Information about Kuzushiji Kanji (KKanji) Dataset

Kuzushiji Kanji (KKanji) Dataset Description

  • Homepage: http://codh.rois.ac.jp/kmnist/index.html.en
  • Repository: https://github.com/rois-codh/kmnist
  • Paper: Deep Learning for Classical Japanese Literature. Tarin Clanuwat et al. arXiv:1812.01718
  • Point of Contact: http://codh.rois.ac.jp/feedback/
Kuzushiji Kanji (KKanji) Dataset Curators

Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto and David Ha

Kuzushiji Kanji (KKanji) Dataset Licensing Information

CC BY-SA 4.0 License

Kuzushiji Kanji (KKanji) Dataset Citation Information
				
					@online{clanuwat2018deep,
  author       = {Tarin Clanuwat and Mikel Bober-Irizar and Asanobu Kitamoto and Alex Lamb and Kazuaki Yamamoto and David Ha},
  title        = {Deep Learning for Classical Japanese Literature},
  date         = {2018-12-03},
  year         = {2018},
  eprintclass  = {cs.CV},
  eprinttype   = {arXiv},
  eprint       = {cs.CV/1812.01718},
}Kuzushiji-Kanji (KKanji) Dataset FAQs

				
			

Kuzushiji Kanji (KKanji) Dataset FAQs

What is the Kuzushiji Kanji (KKanji) dataset for Python?

The Kuzushiji-Kanji dataset is a Machine Learning dataset of the Kanji characters. It is a dataset of 140,426 square 64×64 pixel images of handwritten kanji characters labeled between 0 and 3831. The images are in grayscale format.

What is the Kuzushiji Kanji (KKanji) dataset used for

Kuzushiji-Kanji is used as a popular dataset of Kanji Characters used in Japanese Language.

How to download the Kuzushiji Kanji (KKanji) dataset in Python?

You can load Kuzushiji-Kanji dataset fast with one line of code using the open-source package Activeloop Deep Lake in Python. See detailed instructions on how to load the Kuzushiji Kanji dataset in Python.

How can I use Kuzushiji Kanji (KKanji) dataset in PyTorch or TensorFlow?

You can stream Kuzushiji-Kanji dataset while training a model in PyTorch or TensorFlow with one line of code using the open-source package Activeloop Deep Lake in Python. See detailed instructions on how to train a model on Kuzushiji-Kanji dataset with PyTorch in Python or train a model on Kuzushiji-Kanji dataset with TensorFlow in Python.

Should I work with Kuzushiji Kanji (KKanji) dataset in CSV?

No. CSV is not optimized for working with image data, especially for machine learning workflows. Instead of downloading the Kuzushiji-Kanji dataset CSV, you easily load, version-control, query, and manipulate Kuzushiji-Kanji for machine learning purposes using Activeloop Deep Lake.

How to create an Image Dataset like Kuzushiji Kanji (KKanji) dataset?

With Activeloop Deep Lake, creating image datasets like the Kuzushiji-Kanji character dataset is simple. Simple datasets like Kuzushiji-Kanji can be created automatically by allowing Deep Lake to parse the legacy files into Deep Lake dataset format. More complex datasets can be created manually.

Kuzushiji-Kanji vs Kuzushiji-MNIST. What is the difference between Kuzushiji-Kanji and Kuzushiji-MNIST?

Kuzushiji-Kanji and Kuzushiji-MNIST are two separate datasets. Kuzushiji-MNIST dataset is meant to be a drop-in replacement to MNIST dataset. It contains 28×28 grayscale, and 70,000 images, similar to MNIST. While MNIST has 10 classes, in the Japanese Language there are 48 Hiragana characters and one Hiragana iteration mark. Hence, one Hirangana character was chosen to represent 10 rows of Hiragana.

On the other hand, Kuzushiji-Kanji is a large and highly imbalanced Dataset with the sole purpose to provide a detailed Dataset for Kanji Characters. The high-class imbalance of Kuzushiji-Kanji is because of the frequency of appearance in the real books from which the data was sourced, and kept that way to represent the real data distribution.

What is the size of each image in the Kuzushiji-Kanji (KKanji) dataset?

Kuzushiji-Kanji dataset image size is constant across all images of the dataset. Each Kuzushiji-Kanji dataset image is a fixed-size 64×64 pixel square image.

Deep Lake community member Uday Uppal has contributed to this dataset. You rock, Uday! 🙂

CONTENTS