Ayushman72 commited on
Commit
578a261
·
1 Parent(s): 2f9fa75

Revert "Update README.md"

Browse files

This reverts commit 7657166733c3abeb525aa0258af4270329374542.

Files changed (1) hide show
  1. README.md +54 -5
README.md CHANGED
@@ -1,10 +1,59 @@
1
  ---
2
  pipeline_tag: text-to-image
3
  tags:
4
- - model_hub_mixin
5
- - pytorch_model_hub_mixin
 
 
 
6
  ---
 
7
 
8
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
- - Library: https://huggingface.co/ayushman72/ImageCaptioning
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  pipeline_tag: text-to-image
3
  tags:
4
+ - image captioing
5
+ - vit
6
+ - gpt
7
+ - gpt2
8
+ - torch
9
  ---
10
+ # Image Captioning using ViT and GPT2 architecture
11
 
12
+ This is my attempt to make a transformer model which takes image as the input and provides a caption for the image
13
+
14
+ ## Model Architecture
15
+ It comprises of 12 ViT encoder and 12 GPT2 decoders
16
+
17
+ ![Model Architecture](images/model.png)
18
+
19
+ ## Training
20
+ The model was trained on the dataset Flickr30k which comprises of 30k images and 5 captions for each image
21
+ The model was trained for 8 epochs (which took 10hrs on kaggle's P100 GPU)
22
+
23
+ ## Results
24
+ The model acieved a BLEU-4 score of 0.2115, CIDEr score of 0.4, METEOR score of 0.25, and SPICE score of 0.19 on the Flickr8k dataset
25
+
26
+ These are the loss curves.
27
+
28
+
29
+ ![Loss graph](images/loss.png)
30
+ ![perplexity graph](images/perplexity.png)
31
+
32
+ ## Predictions
33
+ To predict your own images download the models.py, predict.py and the requirements.txt and then run the following commands->
34
+
35
+ `pip install -r requirements.txt`
36
+
37
+ `python predict.py`
38
+
39
+ *Predicting for the first time will take time as it has to download the model weights (1GB)*
40
+
41
+ Here are a few examples of the prediction done on the Validation dataset
42
+
43
+ ![Test 1](images/test1.png)
44
+ ![Test 2](images/test2.png)
45
+ ![Test 3](images/test3.png)
46
+ ![Test 4](images/test4.png)
47
+ ![Test 5](images/test5.png)
48
+ ![Test 6](images/test6.png)
49
+ ![Test 7](images/test7.png)
50
+ ![Test 8](images/test8.png)
51
+ ![Test 9](images/test9.png)
52
+
53
+ As we can see these are not the most amazing predictions. The performance could be improved by training it further and using an even bigger dataset like MS COCO (500k captioned images)
54
+
55
+ ## FAQ
56
+
57
+ Check the [full notebook](./imagecaptioning.ipynb) or [Kaggle](https://www.kaggle.com/code/ayushman72/imagecaptioning)
58
+
59
+ Download the [weights](https://drive.google.com/file/d/1X51wAI7Bsnrhd2Pa4WUoHIXvvhIcRH7Y/view?usp=drive_link) of the model