Tabular Data Generation using Binary Diffusion
Abstract
Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.
Community
The paper introduces a novel method for generating synthetic tabular data using a novel Binary Diffusion model. It transforms tabular data into fixed-size binary representations and employs XOR operations and binary cross-entropy loss for training. This approach simplifies preprocessing, avoids large pretrained models, and achieves state-of-the-art results on benchmark datasets like Travel, Adult Income, and Diabetes while maintaining a smaller model size.
Code will be released soon
Cool work, congrats!
Let us know if you need any help publishing artifacts (model, datasets) on the hub. Leaving some guides here:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mambular: A Sequential Model for Tabular Deep Learning (2024)
- Deep Feature Embedding for Tabular Data (2024)
- Data-Efficient Generation for Dataset Distillation (2024)
- HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection (2024)
- Generative Dataset Distillation Based on Diffusion Model (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper