Model Card for Model ID
This model is a fine-tuned version of the CodeGemma-2B base model that generates high-quality docstrings for Python code functions.
Model Details
Model Description
The DocuMint model is a fine-tuned variant of Google's CodeGemma-2B base model, which was originally trained to predict the next token on internet text without any instructions. The DocuMint model has been fine-tuned using supervised instruction fine-tuning on a dataset of 100,000 Python functions and their respective docstrings extracted from the Free and open-source software (FOSS) ecosystem. The fine-tuning was performed using Low-Rank Adaptation (LoRA).
The goal of the DocuMint model is to generate docstrings that are concise (brief and to the point), complete (cover functionality, parameters, return values, and exceptions), and clear (use simple language and avoid ambiguity).
- Developed by: Bibek Poudel, Adam Cook, Sekou Traore, Shelah Ameli (University of Tennessee, Knoxville)
- Model type: Causal language model fine-tuned for code documentation generation
- Language(s) (NLP): English, Python
- License: MIT
- Finetuned from model: google/codegemma-2b
Model Sources
- Repository: GitHub
- Paper: DocuMint: Docstring Generation for Python using Small Language Models
Uses
Direct Use
The DocuMint model can be used directly to generate high-quality docstrings for Python functions. Given a Python function definition, the model will output a docstring in the format
"""<generated docstring>""".
Fine-tuning Details
Fine-tuning Data
The fine-tuning data consists of 100,000 Python functions and their docstrings extracted from popular open-source repositories in the FOSS ecosystem. Repositories were filtered based on metrics such as number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k) to focus on well-established and actively maintained projects.
Fine-tuning Hyperparameters
Hyperparameter | Value |
---|---|
Fine-tuning Method | LoRA |
Epochs | 4 |
Batch Size | 8 |
Gradient Accumulation Steps | 16 |
Initial Learning Rate | 2e-4 |
LoRA Parameters | 78,446,592 |
Training Tokens | 185,040,896 |
Evaluation
Metrics
Accuracy: Measures the coverage of the generated docstring on code elements like input/output variables. Calculated using cosine similarity between the generated and expert docstring embeddings.
Conciseness: Measures the ability to convey information succinctly without verbosity. Calculated as a compression ratio between the compressed and original docstring sizes.
Clarity: Measures readability using simple, unambiguous language. Calculated using the Flesch-Kincaid readability score.
Model Inference
For running inference, PEFT must be used to load the fine-tuned model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
config = PeftConfig.from_pretrained(self.model_id)
model = AutoModelForCausalLM.from_pretrained("google/codegemma-2b", device_map = self.device)
fine_tuned_model = PeftModel.from_pretrained(model, "documint/CodeGemma2B-fine-tuned", device_map = self.device)
Hardware
Fine-tuning was performed using an Intel 12900K CPU, a Nvidia RTX-3090 GPU, and 64 GB RAM. Total fine-tuning time was 48 GPU hours.
Citation
BibTeX:
@article{poudel2024documint,
title={DocuMint: Docstring Generation for Python using Small Language Models},
author={Poudel, Bibek and Cook, Adam and Traore, Sekou and Ameli, Shelah},
journal={arXiv preprint arXiv:2405.10243},
year={2024}
}
Model Card Contact
- For questions or more information, please contact:
{bpoudel3,acook46,staore1,oameli}@vols.utk.edu
- Downloads last month
- 4
Model tree for documint/google-codegemma-2b-documint
Base model
google/codegemma-2b