File size: 3,216 Bytes
a1c3590
7db28a3
 
 
 
a1c3590
 
7db28a3
a9e2ffc
bec7206
7db28a3
19fce4d
 
 
 
 
 
bec7206
19fce4d
 
 
 
bec7206
19fce4d
7db28a3
c0931f3
 
efa4892
7db28a3
80838a4
7db28a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80838a4
7db28a3
a9e2ffc
 
642ba04
17b3b4c
 
1dc3eb8
a9e2ffc
bec7206
642ba04
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
pipeline_tag: image-classification
tags:
- arxiv:2010.07611
- arxiv:2104.00298
license: cc-by-nc-4.0
---

To be clear, this model is tailored to my image and video classification tasks, not to imagenet.
I built EfficientNetV2.5 s to outperform the existing EfficientNet b0 to b4, EfficientNet b1 to b4 pruned (I pruned b4), and EfficientNetV2 t to l models, whether trained using TensorFlow or PyTorch, in terms of top-1 accuracy, efficiency, and robustness on my datasets and GVNS benchmarks.

## Model Details
- **Model tasks:** Image classification / video classification / feature backbone
- **Model stats:**
  - Params: 16.64 M
  - Multiply-Add Operations: 5.32 G
  - Image size: train = 299x299 / 304x304, test = 304x304
  - Classification layer: included, and defaults to 1,000 classes
- **Papers:**
  - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298
  - Layer-adaptive sparsity for the Magnitude-based Pruning: https://arxiv.org/abs/2010.07611
- **Dataset:** ImageNet-1k
- **Pretrained:** Yes, but requires more pretraining
- **Original:** This model architecture is original

<br>

### Prepare Model for Training
To change the number of classes, replace the linear classification layer. 
Here's an example of how to convert the architecture into a trainable model.
```bash
pip install ptflops
```
```python
from ptflops import get_model_complexity_info
import torch
import urllib.request

nclass = 3                  # number of classes in your dataset
input_size = (3, 304, 304)  # recommended image input size
print_layer_stats = True    # prints the statistics for each layer of the model
verbose = True              # prints additional info about the MAC calculation

# Download the model. Skip this step if already downloaded
base_model = "efficientnetv2.5_base_in1k"
url = f"https://huggingface.co/FredZhang7/efficientnetv2.5_rw_s/resolve/main/{base_model}.pth"
file_name = f"./{base_model}.pth"
urllib.request.urlretrieve(url, file_name)

model = torch.load(file_name)
model.classifier = torch.nn.Linear(in_features=1984, out_features=nclass, bias=True)
macs, nparams = get_model_complexity_info(model, input_size, as_strings=False, print_per_layer_stat=print_layer_stats, verbose=verbose)
traced_model = torch.jit.trace(model, example_inputs)

model_name = f'{base_model}_{"{:.2f}".format(nparams / 1e6)}M_{"{:.2f}".format(macs / 1e9)}G.pth'
traced_model.save(model_name)

# Load the trainable model
model = torch.load(model_name)
```

### Top-1 Accuracy Comparisons
I finetuned the existing models on either 299x299, 304x304, 320x320, or 384x384 resolution, depending on the input size used during pretraining and the VRAM usage.

`efficientnet_b3_pruned` achieved the second highest top-1 accuracy as well as the highest epoch-1 training accuracy on my task, out of all previous EfficientNet models my 24 GB VRAM RTX 3090 could handle.

I will publish the detailed report in another model repository, including the link to the GVNS benchmarks. 
This repository is only for the base model, pretrained on ImageNet, not my task.

### Carbon Emissions
Comparing all models and testing my new architectures costed roughly 504 GPU hours, over a span of 27 days.