Model Card for pop-lyrics-generator-v1

Finetuned from openai-community/gpt2 on smgriffin/modern-pop-lyrics - generates lyrics for specific pop artists.

Model Description

It's pretty good at generating a song structure and stylized lyrics by artist, but bad at rhyming. Sometimes repeats the same thing over and over, but so do pop artists. It might be good for inspiration while writing lyrics. Some of the content generated can be really silly and potentially offensive - especially if you input Lil Wayne.

  • Developed by: Scott Griffin
  • Model type: Generative Language
  • Language(s) (NLP): English, Spanish
  • Finetuned from model [optional]: openai-community/gpt2

Check out the w&b run here: https://wandb.ai/scottgriffinm-scott-griffin-industrial-complex/pop-lyrics-generator-v1?nw=nwuserscottgriffinm

& my blog post on making it here: https://scottsblog.glitch.me#pop-lyrics-generator-v1

Uses

This model is not for commercial use. The content is the property of the individual artist from which the model was finetuned. This is for research purposes only.

How to Use

Use the code below to generate lyrics:

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# load model
model_name = "smgriffin/pop-lyrics-generator-v1"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# create text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# prompt for justin bieber lyrics
artist_name = "Justin Bieber"
prompt = f"Artist: {artist_name}\nLyrics:"

# generate and print
generated_texts = text_generator(
    prompt, 
    max_length=150,
    num_return_sequences=1,  
    temperature=0.9,  # less than .9 results in a lot of repeated lyrics
    top_k=50,
    top_p=0.95,
    do_sample=True, 
)

print("Generated Lyrics:")
print(generated_texts[0]["generated_text"])

How to Fine-Tune Your Own Lyric Generation Model

Use the code below to get finetune your own GPT2 model (for example on the smgriffin/modern-pop-lyrics dataset):

import os
import pandas as pd
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling


# output directory
output_dir = "/your/output/directory"
os.makedirs(output_dir, exist_ok=True)

# load dataset
dataset = load_dataset("smgriffin/modern-pop-lyrics")

# preprocess dataset
def preprocess_function(example):
    # Combine artist name with lyrics for conditioning
    combined = [f"Artist: {artist}\nLyrics: {lyrics}\n\n" for artist, lyrics in zip(example['artist'], example['lyrics'])]
    return {"text": combined}

processed_dataset = dataset.map(preprocess_function, batched=True)

# split to train and test sets
train_test_split = processed_dataset["train"].train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
val_dataset = train_test_split["test"]

# load tokenizer, model
model_name = "gpt2"  # Base GPT-2 model for fine-tuning
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# fill pad_token with eos_tone (gpt2 doesn't have a padding token)
tokenizer.pad_token = tokenizer.eos_token

# tokenize dataset
def tokenize_function(example):
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    return {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
        "labels": tokenized["input_ids"], 
    }

train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])
val_dataset = val_dataset.map(tokenize_function, batched=True, remove_columns=["artist", "lyrics", "text"])

# data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# load GPT-2
model = GPT2LMHeadModel.from_pretrained(model_name)

# training arguments
training_args = TrainingArguments(
    output_dir=output_dir, 
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=8,
    num_train_epochs=10,  
    save_steps=1000,
    save_total_limit=1,  
    logging_dir=f"{output_dir}/logs", 
    logging_steps=50,
    gradient_accumulation_steps=2,  
    fp16=True, 
    max_grad_norm=1.0,
    push_to_hub=False,
)

# init trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# start fine-tuning
trainer.train()

# save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
Downloads last month
144
Safetensors
Model size
124M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for smgriffin/pop-lyrics-generator-v1

Finetuned
(1326)
this model
Quantizations
2 models

Dataset used to train smgriffin/pop-lyrics-generator-v1

Space using smgriffin/pop-lyrics-generator-v1 1