init
Browse files- .gitattributes +1 -0
- .gitignore +4 -0
- README.md +106 -0
- assets/LipCoordNet_model_architecture.png +0 -0
- assets/training_graphs.png +0 -0
- cvtransforms.py +13 -0
- data/coords_train.txt +0 -0
- data/coords_val.txt +0 -0
- dataset.py +190 -0
- inference.py +350 -0
- lip_coordinate_extraction/lips_coords_extractor.py +203 -0
- lip_coordinate_extraction/shape_predictor_68_face_landmarks_GTX.dat +3 -0
- model.py +113 -0
- options.py +21 -0
- pretrain/LipCoordNet_coords_loss_0.025581153109669685_wer_0.01746208431890914_cer_0.006488426950253695.pt +3 -0
- requirements.txt +0 -0
- runs/Dec06_14-41-13_coords/events.out.tfevents.1701844873.coords +3 -0
- train.py +237 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.dat filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
venv
|
2 |
+
__pycache__
|
3 |
+
samples
|
4 |
+
output_videos
|
README.md
CHANGED
@@ -1,3 +1,109 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
|
5 |
+
# LipCoordNet: Enhanced Lip Reading with Landmark Coordinates
|
6 |
+
|
7 |
+
## Introduction
|
8 |
+
|
9 |
+
LipCoordNet is an advanced neural network model designed for accurate lip reading by incorporating lip landmark coordinates as a supplementary input to the traditional image sequence input. This enhancement to the original LipNet architecture aims to improve the precision of sentence predictions by providing additional geometric context to the model.
|
10 |
+
|
11 |
+
## Features
|
12 |
+
|
13 |
+
- **Dual Input System**: Utilizes both raw image sequences and corresponding lip landmark coordinates for improved context.
|
14 |
+
- **Enhanced Spatial Resolution**: Improved spatial analysis of lip movements through detailed landmark tracking.
|
15 |
+
- **State-of-the-Art Performance**: Outperforms the original LipNet, as well as VIPL's [PyTorch implementation of LipNet](https://github.com/VIPL-Audio-Visual-Speech-Understanding/LipNet-PyTorch).
|
16 |
+
| Scenario | Image Size (W x H) | CER | WER |
|
17 |
+
| :-------------------------------: | :----------------: | :--: | :---: |
|
18 |
+
| Unseen speakers (Original) | 100 x 50 | 6.7% | 13.6% |
|
19 |
+
| Overlapped speakers (Original) | 100 x 50 | 2.0% | 5.6% |
|
20 |
+
| Unseen speakers (VIPL LipNet) | 128 x 64 | 6.7% | 13.3% |
|
21 |
+
| Overlapped speakers (VIPL LipNet) | 128 x 64 | 1.9% | 4.6% |
|
22 |
+
| Overlapped speakers (LipCoordNet) | 128 x 64 | 0.6% | 1.7% |
|
23 |
+
|
24 |
+
## Getting Started
|
25 |
+
|
26 |
+
### Prerequisites
|
27 |
+
|
28 |
+
- Python 3.10 or later
|
29 |
+
- Pytorch 2.0 or later
|
30 |
+
- OpenCV
|
31 |
+
- NumPy
|
32 |
+
- dlib (for landmark detection)
|
33 |
+
- The detailed list of dependencies can be found in `requirements.txt`.
|
34 |
+
|
35 |
+
### Installation
|
36 |
+
|
37 |
+
1. Clone the repository:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
git clone https://huggingface.co/SilentSpeak/LipCoordNet
|
41 |
+
```
|
42 |
+
|
43 |
+
2. Navigate to the project directory:
|
44 |
+
```bash
|
45 |
+
cd LipCoordNet
|
46 |
+
```
|
47 |
+
3. Install the required dependencies:
|
48 |
+
```bash
|
49 |
+
pip install -r requirements.txt
|
50 |
+
```
|
51 |
+
|
52 |
+
### Usage
|
53 |
+
|
54 |
+
To train the LipCoordNet model with your dataset, first update the options.py file with the appropriate paths to your dataset and pretrained weights (comment out the weights if you want to start from scratch). Then, run the following command:
|
55 |
+
|
56 |
+
```bash
|
57 |
+
python train.py
|
58 |
+
```
|
59 |
+
|
60 |
+
To perform sentence prediction using the pre-trained model:
|
61 |
+
|
62 |
+
```bash
|
63 |
+
python inference.py --input_video <path_to_video>
|
64 |
+
```
|
65 |
+
|
66 |
+
note: ffmpeg is required to convert video to image sequence and run the inference script.
|
67 |
+
|
68 |
+
## Model Architecture
|
69 |
+
|
70 |
+
![LipCoordNet model architecture](./assets/LipCoordNet_model_architecture.png)
|
71 |
+
|
72 |
+
## Training
|
73 |
+
|
74 |
+
This model is built on top of the [LipNet-Pytorch](https://github.com/VIPL-Audio-Visual-Speech-Understanding/LipNet-PyTorch) project on GitHub. The training process if similar to the original LipNet model, with the addition of landmark coordinates as a supplementary input. We used the pretrained weights from the original LipNet model as a starting point for training our model, froze the weights for the original LipNet layers, and trained the new layers for the landmark coordinates.
|
75 |
+
|
76 |
+
The dataset used to train this model is the [EGCLLC dataset](https://huggingface.co/datasets/SilentSpeak/EGCLLC). The dataset is not included in this repository, but can be downloaded from the link above.
|
77 |
+
|
78 |
+
Total training time: 2 days
|
79 |
+
Total epochs: 51
|
80 |
+
Training hardware: NVIDIA GeForce RTX 3080 12GB
|
81 |
+
|
82 |
+
![LipCoordNet training curves](./assets/training_graphs.png)
|
83 |
+
|
84 |
+
For an interactive view of the training curves, please refer to the tensorboard logs in the `runs` directory.
|
85 |
+
Use this command to view the logs:
|
86 |
+
|
87 |
+
```bash
|
88 |
+
tensorboard --logdir runs
|
89 |
+
```
|
90 |
+
|
91 |
+
## Evaluation
|
92 |
+
|
93 |
+
We achieved a lowest WER of 1.7%, CER of 0.6% and a loss of 0.0256 on the validation dataset.
|
94 |
+
|
95 |
+
## License
|
96 |
+
|
97 |
+
This project is licensed under the MIT License.
|
98 |
+
|
99 |
+
## Acknowledgments
|
100 |
+
|
101 |
+
This model, LipCoordNet, has been developed with reference to the LipNet-PyTorch implementation available at [VIPL-Audio-Visual-Speech-Understanding](https://github.com/VIPL-Audio-Visual-Speech-Understanding/LipNet-PyTorch). We extend our gratitude to the contributors of this repository for providing a solid foundation and insightful examples that greatly facilitated the development of our enhanced lip reading model. Their work has been instrumental in advancing the field of audio-visual speech understanding and has provided the community with valuable resources to build upon.
|
102 |
+
|
103 |
+
Alvarez Casado, C., Bordallo Lopez, M. Real-time face alignment: evaluation methods, training strategies and implementation optimization. Springer Journal of Real-time image processing, 2021
|
104 |
+
|
105 |
+
Assael, Y., Shillingford, B., Whiteson, S., & Freitas, N. (2017). LipNet: End-to-End Sentence-level Lipreading. GPU Technology Conference.
|
106 |
+
|
107 |
+
## Contact
|
108 |
+
|
109 |
+
Project Link: https://github.com/ffeew/LipCoordNet
|
assets/LipCoordNet_model_architecture.png
ADDED
assets/training_graphs.png
ADDED
cvtransforms.py
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import random
|
2 |
+
|
3 |
+
|
4 |
+
def HorizontalFlip(batch_img, p=0.5):
|
5 |
+
# (T, H, W, C)
|
6 |
+
if random.random() > p:
|
7 |
+
batch_img = batch_img[:, :, ::-1, ...]
|
8 |
+
return batch_img
|
9 |
+
|
10 |
+
|
11 |
+
def ColorNormalize(batch_img):
|
12 |
+
batch_img = batch_img / 255.0
|
13 |
+
return batch_img
|
data/coords_train.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
data/coords_val.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dataset.py
ADDED
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import numpy as np
|
2 |
+
import cv2
|
3 |
+
import os
|
4 |
+
from torch.utils.data import Dataset
|
5 |
+
from cvtransforms import *
|
6 |
+
import torch
|
7 |
+
import editdistance
|
8 |
+
import json
|
9 |
+
|
10 |
+
|
11 |
+
class MyDataset(Dataset):
|
12 |
+
letters = [
|
13 |
+
" ",
|
14 |
+
"A",
|
15 |
+
"B",
|
16 |
+
"C",
|
17 |
+
"D",
|
18 |
+
"E",
|
19 |
+
"F",
|
20 |
+
"G",
|
21 |
+
"H",
|
22 |
+
"I",
|
23 |
+
"J",
|
24 |
+
"K",
|
25 |
+
"L",
|
26 |
+
"M",
|
27 |
+
"N",
|
28 |
+
"O",
|
29 |
+
"P",
|
30 |
+
"Q",
|
31 |
+
"R",
|
32 |
+
"S",
|
33 |
+
"T",
|
34 |
+
"U",
|
35 |
+
"V",
|
36 |
+
"W",
|
37 |
+
"X",
|
38 |
+
"Y",
|
39 |
+
"Z",
|
40 |
+
]
|
41 |
+
|
42 |
+
def __init__(
|
43 |
+
self,
|
44 |
+
video_path,
|
45 |
+
anno_path,
|
46 |
+
coords_path,
|
47 |
+
file_list,
|
48 |
+
vid_pad,
|
49 |
+
txt_pad,
|
50 |
+
phase,
|
51 |
+
):
|
52 |
+
self.anno_path = anno_path
|
53 |
+
self.coords_path = coords_path
|
54 |
+
self.vid_pad = vid_pad
|
55 |
+
self.txt_pad = txt_pad
|
56 |
+
self.phase = phase
|
57 |
+
|
58 |
+
with open(file_list, "r") as f:
|
59 |
+
self.videos = [
|
60 |
+
os.path.join(video_path, line.strip()) for line in f.readlines()
|
61 |
+
]
|
62 |
+
|
63 |
+
self.data = []
|
64 |
+
for vid in self.videos:
|
65 |
+
items = vid.split("/")
|
66 |
+
self.data.append((vid, items[-4], items[-1]))
|
67 |
+
|
68 |
+
def __getitem__(self, idx):
|
69 |
+
(vid, spk, name) = self.data[idx]
|
70 |
+
vid = self._load_vid(vid)
|
71 |
+
anno = self._load_anno(
|
72 |
+
os.path.join(self.anno_path, spk, "align", name + ".align")
|
73 |
+
)
|
74 |
+
coord = self._load_coords(os.path.join(self.coords_path, spk, name + ".json"))
|
75 |
+
|
76 |
+
if self.phase == "train":
|
77 |
+
vid = HorizontalFlip(vid)
|
78 |
+
|
79 |
+
vid = ColorNormalize(vid)
|
80 |
+
|
81 |
+
vid_len = vid.shape[0]
|
82 |
+
anno_len = anno.shape[0]
|
83 |
+
vid = self._padding(vid, self.vid_pad)
|
84 |
+
anno = self._padding(anno, self.txt_pad)
|
85 |
+
coord = self._padding(coord, self.vid_pad)
|
86 |
+
|
87 |
+
return {
|
88 |
+
"vid": torch.FloatTensor(vid.transpose(3, 0, 1, 2)),
|
89 |
+
"txt": torch.LongTensor(anno),
|
90 |
+
"coord": torch.FloatTensor(coord),
|
91 |
+
"txt_len": anno_len,
|
92 |
+
"vid_len": vid_len,
|
93 |
+
}
|
94 |
+
|
95 |
+
def __len__(self):
|
96 |
+
return len(self.data)
|
97 |
+
|
98 |
+
def _load_vid(self, p):
|
99 |
+
files = os.listdir(p)
|
100 |
+
files = list(filter(lambda file: file.find(".jpg") != -1, files))
|
101 |
+
files = sorted(files, key=lambda file: int(os.path.splitext(file)[0]))
|
102 |
+
array = [cv2.imread(os.path.join(p, file)) for file in files]
|
103 |
+
array = list(filter(lambda im: not im is None, array))
|
104 |
+
array = [
|
105 |
+
cv2.resize(im, (128, 64), interpolation=cv2.INTER_LANCZOS4) for im in array
|
106 |
+
]
|
107 |
+
array = np.stack(array, axis=0).astype(np.float32)
|
108 |
+
|
109 |
+
return array
|
110 |
+
|
111 |
+
def _load_anno(self, name):
|
112 |
+
with open(name, "r") as f:
|
113 |
+
lines = [line.strip().split(" ") for line in f.readlines()]
|
114 |
+
txt = [line[2] for line in lines]
|
115 |
+
txt = list(filter(lambda s: not s.upper() in ["SIL", "SP"], txt))
|
116 |
+
return MyDataset.txt2arr(" ".join(txt).upper(), 1)
|
117 |
+
|
118 |
+
def _load_coords(self, name):
|
119 |
+
# obtained from the resized image in the lip coordinate extraction
|
120 |
+
img_width = 600
|
121 |
+
img_height = 500
|
122 |
+
with open(name, "r") as f:
|
123 |
+
coords_data = json.load(f)
|
124 |
+
|
125 |
+
coords = []
|
126 |
+
for frame in sorted(coords_data.keys(), key=int):
|
127 |
+
frame_coords = coords_data[frame]
|
128 |
+
|
129 |
+
# Normalize the coordinates
|
130 |
+
normalized_coords = []
|
131 |
+
for x, y in zip(frame_coords[0], frame_coords[1]):
|
132 |
+
normalized_x = x / img_width
|
133 |
+
normalized_y = y / img_height
|
134 |
+
normalized_coords.append((normalized_x, normalized_y))
|
135 |
+
|
136 |
+
coords.append(normalized_coords)
|
137 |
+
coords_array = np.array(coords, dtype=np.float32)
|
138 |
+
return coords_array
|
139 |
+
|
140 |
+
def _padding(self, array, length):
|
141 |
+
array = [array[_] for _ in range(array.shape[0])]
|
142 |
+
size = array[0].shape
|
143 |
+
for i in range(length - len(array)):
|
144 |
+
array.append(np.zeros(size))
|
145 |
+
return np.stack(array, axis=0)
|
146 |
+
|
147 |
+
@staticmethod
|
148 |
+
def txt2arr(txt, start):
|
149 |
+
arr = []
|
150 |
+
for c in list(txt):
|
151 |
+
arr.append(MyDataset.letters.index(c) + start)
|
152 |
+
return np.array(arr)
|
153 |
+
|
154 |
+
@staticmethod
|
155 |
+
def arr2txt(arr, start):
|
156 |
+
txt = []
|
157 |
+
for n in arr:
|
158 |
+
if n >= start:
|
159 |
+
txt.append(MyDataset.letters[n - start])
|
160 |
+
return "".join(txt).strip()
|
161 |
+
|
162 |
+
@staticmethod
|
163 |
+
def ctc_arr2txt(arr, start):
|
164 |
+
pre = -1
|
165 |
+
txt = []
|
166 |
+
for n in arr:
|
167 |
+
if pre != n and n >= start:
|
168 |
+
if (
|
169 |
+
len(txt) > 0
|
170 |
+
and txt[-1] == " "
|
171 |
+
and MyDataset.letters[n - start] == " "
|
172 |
+
):
|
173 |
+
pass
|
174 |
+
else:
|
175 |
+
txt.append(MyDataset.letters[n - start])
|
176 |
+
pre = n
|
177 |
+
return "".join(txt).strip()
|
178 |
+
|
179 |
+
@staticmethod
|
180 |
+
def wer(predict, truth):
|
181 |
+
word_pairs = [(p[0].split(" "), p[1].split(" ")) for p in zip(predict, truth)]
|
182 |
+
wer = [1.0 * editdistance.eval(p[0], p[1]) / len(p[1]) for p in word_pairs]
|
183 |
+
return wer
|
184 |
+
|
185 |
+
@staticmethod
|
186 |
+
def cer(predict, truth):
|
187 |
+
cer = [
|
188 |
+
1.0 * editdistance.eval(p[0], p[1]) / len(p[1]) for p in zip(predict, truth)
|
189 |
+
]
|
190 |
+
return cer
|
inference.py
ADDED
@@ -0,0 +1,350 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import argparse
|
2 |
+
import os
|
3 |
+
from model import LipCoordNet
|
4 |
+
from dataset import MyDataset
|
5 |
+
import torch
|
6 |
+
import cv2
|
7 |
+
import face_alignment
|
8 |
+
import numpy as np
|
9 |
+
import dlib
|
10 |
+
import glob
|
11 |
+
|
12 |
+
|
13 |
+
def get_position(size, padding=0.25):
|
14 |
+
x = [
|
15 |
+
0.000213256,
|
16 |
+
0.0752622,
|
17 |
+
0.18113,
|
18 |
+
0.29077,
|
19 |
+
0.393397,
|
20 |
+
0.586856,
|
21 |
+
0.689483,
|
22 |
+
0.799124,
|
23 |
+
0.904991,
|
24 |
+
0.98004,
|
25 |
+
0.490127,
|
26 |
+
0.490127,
|
27 |
+
0.490127,
|
28 |
+
0.490127,
|
29 |
+
0.36688,
|
30 |
+
0.426036,
|
31 |
+
0.490127,
|
32 |
+
0.554217,
|
33 |
+
0.613373,
|
34 |
+
0.121737,
|
35 |
+
0.187122,
|
36 |
+
0.265825,
|
37 |
+
0.334606,
|
38 |
+
0.260918,
|
39 |
+
0.182743,
|
40 |
+
0.645647,
|
41 |
+
0.714428,
|
42 |
+
0.793132,
|
43 |
+
0.858516,
|
44 |
+
0.79751,
|
45 |
+
0.719335,
|
46 |
+
0.254149,
|
47 |
+
0.340985,
|
48 |
+
0.428858,
|
49 |
+
0.490127,
|
50 |
+
0.551395,
|
51 |
+
0.639268,
|
52 |
+
0.726104,
|
53 |
+
0.642159,
|
54 |
+
0.556721,
|
55 |
+
0.490127,
|
56 |
+
0.423532,
|
57 |
+
0.338094,
|
58 |
+
0.290379,
|
59 |
+
0.428096,
|
60 |
+
0.490127,
|
61 |
+
0.552157,
|
62 |
+
0.689874,
|
63 |
+
0.553364,
|
64 |
+
0.490127,
|
65 |
+
0.42689,
|
66 |
+
]
|
67 |
+
|
68 |
+
y = [
|
69 |
+
0.106454,
|
70 |
+
0.038915,
|
71 |
+
0.0187482,
|
72 |
+
0.0344891,
|
73 |
+
0.0773906,
|
74 |
+
0.0773906,
|
75 |
+
0.0344891,
|
76 |
+
0.0187482,
|
77 |
+
0.038915,
|
78 |
+
0.106454,
|
79 |
+
0.203352,
|
80 |
+
0.307009,
|
81 |
+
0.409805,
|
82 |
+
0.515625,
|
83 |
+
0.587326,
|
84 |
+
0.609345,
|
85 |
+
0.628106,
|
86 |
+
0.609345,
|
87 |
+
0.587326,
|
88 |
+
0.216423,
|
89 |
+
0.178758,
|
90 |
+
0.179852,
|
91 |
+
0.231733,
|
92 |
+
0.245099,
|
93 |
+
0.244077,
|
94 |
+
0.231733,
|
95 |
+
0.179852,
|
96 |
+
0.178758,
|
97 |
+
0.216423,
|
98 |
+
0.244077,
|
99 |
+
0.245099,
|
100 |
+
0.780233,
|
101 |
+
0.745405,
|
102 |
+
0.727388,
|
103 |
+
0.742578,
|
104 |
+
0.727388,
|
105 |
+
0.745405,
|
106 |
+
0.780233,
|
107 |
+
0.864805,
|
108 |
+
0.902192,
|
109 |
+
0.909281,
|
110 |
+
0.902192,
|
111 |
+
0.864805,
|
112 |
+
0.784792,
|
113 |
+
0.778746,
|
114 |
+
0.785343,
|
115 |
+
0.778746,
|
116 |
+
0.784792,
|
117 |
+
0.824182,
|
118 |
+
0.831803,
|
119 |
+
0.824182,
|
120 |
+
]
|
121 |
+
|
122 |
+
x, y = np.array(x), np.array(y)
|
123 |
+
|
124 |
+
x = (x + padding) / (2 * padding + 1)
|
125 |
+
y = (y + padding) / (2 * padding + 1)
|
126 |
+
x = x * size
|
127 |
+
y = y * size
|
128 |
+
return np.array(list(zip(x, y)))
|
129 |
+
|
130 |
+
|
131 |
+
def transformation_from_points(points1, points2):
|
132 |
+
points1 = points1.astype(np.float64)
|
133 |
+
points2 = points2.astype(np.float64)
|
134 |
+
|
135 |
+
c1 = np.mean(points1, axis=0)
|
136 |
+
c2 = np.mean(points2, axis=0)
|
137 |
+
points1 -= c1
|
138 |
+
points2 -= c2
|
139 |
+
s1 = np.std(points1)
|
140 |
+
s2 = np.std(points2)
|
141 |
+
points1 /= s1
|
142 |
+
points2 /= s2
|
143 |
+
|
144 |
+
U, S, Vt = np.linalg.svd(points1.T * points2)
|
145 |
+
R = (U * Vt).T
|
146 |
+
return np.vstack(
|
147 |
+
[
|
148 |
+
np.hstack(((s2 / s1) * R, c2.T - (s2 / s1) * R * c1.T)),
|
149 |
+
np.matrix([0.0, 0.0, 1.0]),
|
150 |
+
]
|
151 |
+
)
|
152 |
+
|
153 |
+
|
154 |
+
def load_video(file, device: str):
|
155 |
+
# create the samples directory if it doesn't exist
|
156 |
+
if not os.path.exists("samples"):
|
157 |
+
os.makedirs("samples")
|
158 |
+
|
159 |
+
p = os.path.join("samples")
|
160 |
+
output = os.path.join("samples", "%04d.jpg")
|
161 |
+
cmd = "ffmpeg -hide_banner -loglevel error -i {} -qscale:v 2 -r 25 {}".format(
|
162 |
+
file, output
|
163 |
+
)
|
164 |
+
os.system(cmd)
|
165 |
+
|
166 |
+
files = os.listdir(p)
|
167 |
+
files = sorted(files, key=lambda x: int(os.path.splitext(x)[0]))
|
168 |
+
|
169 |
+
array = [cv2.imread(os.path.join(p, file)) for file in files]
|
170 |
+
|
171 |
+
array = list(filter(lambda im: not im is None, array))
|
172 |
+
|
173 |
+
fa = face_alignment.FaceAlignment(
|
174 |
+
face_alignment.LandmarksType._2D, flip_input=False, device=device
|
175 |
+
)
|
176 |
+
points = [fa.get_landmarks(I) for I in array]
|
177 |
+
|
178 |
+
front256 = get_position(256)
|
179 |
+
video = []
|
180 |
+
for point, scene in zip(points, array):
|
181 |
+
if point is not None:
|
182 |
+
shape = np.array(point[0])
|
183 |
+
shape = shape[17:]
|
184 |
+
M = transformation_from_points(np.matrix(shape), np.matrix(front256))
|
185 |
+
|
186 |
+
img = cv2.warpAffine(scene, M[:2], (256, 256))
|
187 |
+
(x, y) = front256[-20:].mean(0).astype(np.int32)
|
188 |
+
w = 160 // 2
|
189 |
+
img = img[y - w // 2 : y + w // 2, x - w : x + w, ...]
|
190 |
+
img = cv2.resize(img, (128, 64))
|
191 |
+
video.append(img)
|
192 |
+
|
193 |
+
video = np.stack(video, axis=0).astype(np.float32)
|
194 |
+
video = torch.FloatTensor(video.transpose(3, 0, 1, 2)) / 255.0
|
195 |
+
|
196 |
+
return video
|
197 |
+
|
198 |
+
|
199 |
+
def extract_lip_coordinates(detector, predictor, img_path):
|
200 |
+
image = cv2.imread(img_path)
|
201 |
+
image = cv2.resize(image, (600, 500))
|
202 |
+
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
203 |
+
|
204 |
+
rects = detector(gray)
|
205 |
+
retries = 3
|
206 |
+
while retries > 0:
|
207 |
+
try:
|
208 |
+
assert len(rects) == 1
|
209 |
+
break
|
210 |
+
except AssertionError as e:
|
211 |
+
retries -= 1
|
212 |
+
|
213 |
+
for rect in rects:
|
214 |
+
# apply the shape predictor to the face ROI
|
215 |
+
shape = predictor(gray, rect)
|
216 |
+
x = []
|
217 |
+
y = []
|
218 |
+
for n in range(48, 68):
|
219 |
+
x.append(shape.part(n).x)
|
220 |
+
y.append(shape.part(n).y)
|
221 |
+
return [x, y]
|
222 |
+
|
223 |
+
|
224 |
+
def generate_lip_coordinates(frame_images_directory, detector, predictor):
|
225 |
+
frames = glob.glob(frame_images_directory + "/*.jpg")
|
226 |
+
frames.sort()
|
227 |
+
|
228 |
+
img = cv2.imread(frames[0])
|
229 |
+
height, width, layers = img.shape
|
230 |
+
|
231 |
+
coords = []
|
232 |
+
for frame in frames:
|
233 |
+
x_coords, y_coords = extract_lip_coordinates(detector, predictor, frame)
|
234 |
+
normalized_coords = []
|
235 |
+
for x, y in zip(x_coords, y_coords):
|
236 |
+
normalized_x = x / width
|
237 |
+
normalized_y = y / height
|
238 |
+
normalized_coords.append((normalized_x, normalized_y))
|
239 |
+
coords.append(normalized_coords)
|
240 |
+
coords_array = np.array(coords, dtype=np.float32)
|
241 |
+
coords_array = torch.from_numpy(coords_array)
|
242 |
+
return coords_array
|
243 |
+
|
244 |
+
|
245 |
+
def ctc_decode(y):
|
246 |
+
y = y.argmax(-1)
|
247 |
+
t = y.size(0)
|
248 |
+
result = []
|
249 |
+
for i in range(t + 1):
|
250 |
+
result.append(MyDataset.ctc_arr2txt(y[:i], start=1))
|
251 |
+
return result
|
252 |
+
|
253 |
+
|
254 |
+
def output_video(p, txt, output_path):
|
255 |
+
files = os.listdir(p)
|
256 |
+
files = sorted(files, key=lambda x: int(os.path.splitext(x)[0]))
|
257 |
+
|
258 |
+
font = cv2.FONT_HERSHEY_SIMPLEX
|
259 |
+
|
260 |
+
for file, line in zip(files, txt):
|
261 |
+
img = cv2.imread(os.path.join(p, file))
|
262 |
+
h, w, _ = img.shape
|
263 |
+
img = cv2.putText(
|
264 |
+
img, line, (w // 8, 11 * h // 12), font, 1.2, (0, 0, 0), 3, cv2.LINE_AA
|
265 |
+
)
|
266 |
+
img = cv2.putText(
|
267 |
+
img,
|
268 |
+
line,
|
269 |
+
(w // 8, 11 * h // 12),
|
270 |
+
font,
|
271 |
+
1.2,
|
272 |
+
(255, 255, 255),
|
273 |
+
0,
|
274 |
+
cv2.LINE_AA,
|
275 |
+
)
|
276 |
+
h = h // 2
|
277 |
+
w = w // 2
|
278 |
+
img = cv2.resize(img, (w, h))
|
279 |
+
cv2.imwrite(os.path.join(p, file), img)
|
280 |
+
|
281 |
+
# create the output_videos directory if it doesn't exist
|
282 |
+
if not os.path.exists(output_path):
|
283 |
+
os.makedirs(output_path)
|
284 |
+
|
285 |
+
output = os.path.join(output_path, "output.mp4")
|
286 |
+
cmd = "ffmpeg -hide_banner -loglevel error -y -i {}/%04d.jpg -r 25 {}".format(
|
287 |
+
p, output
|
288 |
+
)
|
289 |
+
os.system(cmd)
|
290 |
+
|
291 |
+
|
292 |
+
def main():
|
293 |
+
parser = argparse.ArgumentParser()
|
294 |
+
parser.add_argument(
|
295 |
+
"--weights",
|
296 |
+
type=str,
|
297 |
+
default="pretrain/LipCoordNet_coords_loss_0.025581153109669685_wer_0.01746208431890914_cer_0.006488426950253695.pt",
|
298 |
+
help="path to the weights file",
|
299 |
+
)
|
300 |
+
parser.add_argument(
|
301 |
+
"--input_video",
|
302 |
+
type=str,
|
303 |
+
help="path to the input video frames",
|
304 |
+
)
|
305 |
+
parser.add_argument(
|
306 |
+
"--device",
|
307 |
+
type=str,
|
308 |
+
default="cuda",
|
309 |
+
help="device to run the model on",
|
310 |
+
)
|
311 |
+
|
312 |
+
parser.add_argument(
|
313 |
+
"--output_path",
|
314 |
+
type=str,
|
315 |
+
default="output_videos",
|
316 |
+
help="directory to save the output video",
|
317 |
+
)
|
318 |
+
|
319 |
+
args = parser.parse_args()
|
320 |
+
|
321 |
+
# validate if device is valid
|
322 |
+
if args.device not in ("cuda", "cpu"):
|
323 |
+
raise ValueError("Invalid device, must be either cuda or cpu")
|
324 |
+
|
325 |
+
device = args.device
|
326 |
+
|
327 |
+
# load model
|
328 |
+
model = LipCoordNet()
|
329 |
+
model.load_state_dict(torch.load(args.weights))
|
330 |
+
model = model.to(device)
|
331 |
+
model.eval()
|
332 |
+
detector = dlib.get_frontal_face_detector()
|
333 |
+
predictor = dlib.shape_predictor(
|
334 |
+
"lip_coordinate_extraction/shape_predictor_68_face_landmarks_GTX.dat"
|
335 |
+
)
|
336 |
+
|
337 |
+
# load video
|
338 |
+
video = load_video(args.input_video, device)
|
339 |
+
|
340 |
+
# generate lip coordinates
|
341 |
+
coords = generate_lip_coordinates("samples", detector, predictor)
|
342 |
+
|
343 |
+
pred = model(video[None, ...].to(device), coords[None, ...].to(device))
|
344 |
+
output = ctc_decode(pred[0])
|
345 |
+
print(output[-1])
|
346 |
+
output_video("samples", output, args.output_path)
|
347 |
+
|
348 |
+
|
349 |
+
if __name__ == "__main__":
|
350 |
+
main()
|
lip_coordinate_extraction/lips_coords_extractor.py
ADDED
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import cv2
|
2 |
+
import dlib
|
3 |
+
import json
|
4 |
+
import glob
|
5 |
+
import os
|
6 |
+
from multiprocessing import Pool
|
7 |
+
|
8 |
+
LIP_COORDINATES_DIRECTORY = "lip_coordinates"
|
9 |
+
ERROR_DIRECTORY = "error_videos"
|
10 |
+
|
11 |
+
# path to the original GRID dataset whose videos are converted to frames
|
12 |
+
GRID_IMAGES_DIRECTORY = "lip/GRID_imgs"
|
13 |
+
train_unseen_list = "data/unseen_val.txt"
|
14 |
+
train_overlap_list = "data/overlap_train.txt"
|
15 |
+
test_unseen_list = "data/unseen_val.txt"
|
16 |
+
test_overlap_list = "data/overlap_val.txt"
|
17 |
+
|
18 |
+
|
19 |
+
def load_data_list(data_path, dictionary):
|
20 |
+
with open(data_path, "r") as f:
|
21 |
+
for line in f.readlines():
|
22 |
+
line = line.strip()
|
23 |
+
speaker = line.split("/")[-4]
|
24 |
+
vid = line.split("/")[-1]
|
25 |
+
dictionary[f"{speaker}/{vid}"] = 1
|
26 |
+
return dictionary
|
27 |
+
|
28 |
+
|
29 |
+
def extract_lip_coordinates(detector, predictor, img_path):
|
30 |
+
# used to preprocess the original image frames in the GRID dataset to extract the lip coordinates
|
31 |
+
image = cv2.imread(img_path)
|
32 |
+
image = cv2.resize(image, (600, 500))
|
33 |
+
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
34 |
+
|
35 |
+
rects = detector(gray)
|
36 |
+
assert len(rects) == 1
|
37 |
+
for rect in rects:
|
38 |
+
# extract the coordinates of the bounding box
|
39 |
+
x1 = rect.left()
|
40 |
+
y1 = rect.top()
|
41 |
+
x2 = rect.right()
|
42 |
+
y2 = rect.bottom()
|
43 |
+
|
44 |
+
# apply the shape predictor to the face ROI
|
45 |
+
shape = predictor(gray, rect)
|
46 |
+
x = []
|
47 |
+
y = []
|
48 |
+
for n in range(48, 68):
|
49 |
+
x.append(shape.part(n).x)
|
50 |
+
y.append(shape.part(n).y)
|
51 |
+
return [x, y]
|
52 |
+
|
53 |
+
|
54 |
+
def log_error_video(video_path):
|
55 |
+
print("Error: ", video_path)
|
56 |
+
with open(ERROR_DIRECTORY + "/error_videos.txt", "a") as f:
|
57 |
+
f.write(video_path + "\n")
|
58 |
+
|
59 |
+
|
60 |
+
data_dict = {}
|
61 |
+
data_dict = load_data_list(train_unseen_list, data_dict)
|
62 |
+
data_dict = load_data_list(train_overlap_list, data_dict)
|
63 |
+
data_dict = load_data_list(test_unseen_list, data_dict)
|
64 |
+
data_dict = load_data_list(test_overlap_list, data_dict)
|
65 |
+
|
66 |
+
|
67 |
+
speakers = glob.glob(GRID_IMAGES_DIRECTORY + "/*")
|
68 |
+
print(speakers[0])
|
69 |
+
|
70 |
+
|
71 |
+
def generate_lip_coordinates(speakers):
|
72 |
+
file_path_sep = "\\"
|
73 |
+
detector = dlib.get_frontal_face_detector()
|
74 |
+
predictor = dlib.shape_predictor(
|
75 |
+
"lip_coordinate_extraction/shape_predictor_68_face_landmarks_GTX.dat"
|
76 |
+
)
|
77 |
+
for speaker in speakers:
|
78 |
+
print(speaker)
|
79 |
+
videos = glob.glob(speaker + "/*")
|
80 |
+
for video in videos:
|
81 |
+
print(video)
|
82 |
+
frames = glob.glob(video + "/*.jpg")
|
83 |
+
if len(frames) < 50: # filter out bad videos
|
84 |
+
continue
|
85 |
+
vid = {}
|
86 |
+
try:
|
87 |
+
frames = sorted(
|
88 |
+
frames,
|
89 |
+
key=lambda x: int(x.split(file_path_sep)[-1].split(".")[0]),
|
90 |
+
)
|
91 |
+
for frame in frames:
|
92 |
+
retry = 3
|
93 |
+
while retry > 0:
|
94 |
+
try:
|
95 |
+
coords = extract_lip_coordinates(detector, predictor, frame)
|
96 |
+
break
|
97 |
+
except Exception as e:
|
98 |
+
retry -= 1
|
99 |
+
print("Error: ", video)
|
100 |
+
print(e)
|
101 |
+
print("retrying...")
|
102 |
+
|
103 |
+
vid[frame.split(file_path_sep)[-1].split(".")[0]] = coords
|
104 |
+
vid_path = video.split(file_path_sep)
|
105 |
+
save_path = (
|
106 |
+
LIP_COORDINATES_DIRECTORY
|
107 |
+
+ "/"
|
108 |
+
+ vid_path[-2]
|
109 |
+
+ "/"
|
110 |
+
+ vid_path[-1]
|
111 |
+
+ ".json"
|
112 |
+
)
|
113 |
+
|
114 |
+
if not os.path.exists(LIP_COORDINATES_DIRECTORY + "/" + vid_path[-2]):
|
115 |
+
os.makedirs(LIP_COORDINATES_DIRECTORY + "/" + vid_path[-2])
|
116 |
+
|
117 |
+
with open(
|
118 |
+
save_path,
|
119 |
+
"w",
|
120 |
+
) as f:
|
121 |
+
json.dump(vid, f)
|
122 |
+
except Exception as e:
|
123 |
+
print(e)
|
124 |
+
log_error_video(video)
|
125 |
+
|
126 |
+
|
127 |
+
def generate_lip_coordinates(speakers):
|
128 |
+
file_path_sep = "\\"
|
129 |
+
detector = dlib.get_frontal_face_detector()
|
130 |
+
predictor = dlib.shape_predictor(
|
131 |
+
"lip_coordinate_extraction/shape_predictor_68_face_landmarks_GTX.dat"
|
132 |
+
)
|
133 |
+
for speaker in speakers:
|
134 |
+
print(speaker)
|
135 |
+
videos = glob.glob(speaker + "/*")
|
136 |
+
for video in videos:
|
137 |
+
# if (
|
138 |
+
# video.split(file_path_sep)[-2] + "/" + video.split(file_path_sep)[-1]
|
139 |
+
# not in data_dict
|
140 |
+
# ):
|
141 |
+
# continue
|
142 |
+
print(video)
|
143 |
+
frames = glob.glob(video + "/*.jpg")
|
144 |
+
if len(frames) < 50: # filter out bad videos
|
145 |
+
continue
|
146 |
+
vid = {}
|
147 |
+
try:
|
148 |
+
frames = sorted(
|
149 |
+
frames,
|
150 |
+
key=lambda x: int(x.split(file_path_sep)[-1].split(".")[0]),
|
151 |
+
)
|
152 |
+
for frame in frames:
|
153 |
+
retry = 3
|
154 |
+
while retry > 0:
|
155 |
+
try:
|
156 |
+
coords = extract_lip_coordinates(detector, predictor, frame)
|
157 |
+
break
|
158 |
+
except Exception as e:
|
159 |
+
retry -= 1
|
160 |
+
print("Error: ", video)
|
161 |
+
print(e)
|
162 |
+
print("retrying...")
|
163 |
+
|
164 |
+
vid[frame.split(file_path_sep)[-1].split(".")[0]] = coords
|
165 |
+
vid_path = video.split(file_path_sep)
|
166 |
+
save_path = (
|
167 |
+
LIP_COORDINATES_DIRECTORY
|
168 |
+
+ "/"
|
169 |
+
+ vid_path[-2]
|
170 |
+
+ "/"
|
171 |
+
+ vid_path[-1]
|
172 |
+
+ ".json"
|
173 |
+
)
|
174 |
+
|
175 |
+
if not os.path.exists(LIP_COORDINATES_DIRECTORY + "/" + vid_path[-2]):
|
176 |
+
os.makedirs(LIP_COORDINATES_DIRECTORY + "/" + vid_path[-2])
|
177 |
+
|
178 |
+
with open(
|
179 |
+
save_path,
|
180 |
+
"w",
|
181 |
+
) as f:
|
182 |
+
json.dump(vid, f)
|
183 |
+
except Exception as e:
|
184 |
+
print(e)
|
185 |
+
log_error_video(video)
|
186 |
+
|
187 |
+
|
188 |
+
num_processes = 8
|
189 |
+
|
190 |
+
speaker_groups = []
|
191 |
+
speaker_interval = len(speakers) // num_processes
|
192 |
+
for i in range(num_processes):
|
193 |
+
if i == 4:
|
194 |
+
speaker_groups.append(speakers[i * speaker_interval :])
|
195 |
+
else:
|
196 |
+
speaker_groups.append(
|
197 |
+
speakers[i * speaker_interval : (i + 1) * speaker_interval]
|
198 |
+
)
|
199 |
+
|
200 |
+
|
201 |
+
if __name__ == "__main__":
|
202 |
+
with Pool(num_processes) as p:
|
203 |
+
p.map(generate_lip_coordinates, speaker_groups)
|
lip_coordinate_extraction/shape_predictor_68_face_landmarks_GTX.dat
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:249a69a1d5f2d7c714a92934d35367d46eb52dc308d46717e82d49e8386b3b80
|
3 |
+
size 66435981
|
model.py
ADDED
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import torch.nn.init as init
|
4 |
+
import math
|
5 |
+
|
6 |
+
|
7 |
+
class LipCoordNet(torch.nn.Module):
|
8 |
+
def __init__(self, dropout_p=0.5, coord_input_dim=40, coord_hidden_dim=128):
|
9 |
+
super(LipCoordNet, self).__init__()
|
10 |
+
self.conv1 = nn.Conv3d(3, 32, (3, 5, 5), (1, 2, 2), (1, 2, 2))
|
11 |
+
self.pool1 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
|
12 |
+
|
13 |
+
self.conv2 = nn.Conv3d(32, 64, (3, 5, 5), (1, 1, 1), (1, 2, 2))
|
14 |
+
self.pool2 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
|
15 |
+
|
16 |
+
self.conv3 = nn.Conv3d(64, 96, (3, 3, 3), (1, 1, 1), (1, 1, 1))
|
17 |
+
self.pool3 = nn.MaxPool3d((1, 2, 2), (1, 2, 2))
|
18 |
+
|
19 |
+
self.gru1 = nn.GRU(96 * 4 * 8, 256, 1, bidirectional=True)
|
20 |
+
self.gru2 = nn.GRU(512, 256, 1, bidirectional=True)
|
21 |
+
|
22 |
+
self.FC = nn.Linear(512 + 2 * coord_hidden_dim, 27 + 1)
|
23 |
+
self.dropout_p = dropout_p
|
24 |
+
|
25 |
+
self.relu = nn.ReLU(inplace=True)
|
26 |
+
self.dropout = nn.Dropout(self.dropout_p)
|
27 |
+
self.dropout3d = nn.Dropout3d(self.dropout_p)
|
28 |
+
|
29 |
+
# New GRU layer for lip coordinates
|
30 |
+
self.coord_gru = nn.GRU(
|
31 |
+
coord_input_dim, coord_hidden_dim, 1, bidirectional=True
|
32 |
+
)
|
33 |
+
|
34 |
+
self._init()
|
35 |
+
|
36 |
+
def _init(self):
|
37 |
+
init.kaiming_normal_(self.conv1.weight, nonlinearity="relu")
|
38 |
+
init.constant_(self.conv1.bias, 0)
|
39 |
+
|
40 |
+
init.kaiming_normal_(self.conv2.weight, nonlinearity="relu")
|
41 |
+
init.constant_(self.conv2.bias, 0)
|
42 |
+
|
43 |
+
init.kaiming_normal_(self.conv3.weight, nonlinearity="relu")
|
44 |
+
init.constant_(self.conv3.bias, 0)
|
45 |
+
|
46 |
+
init.kaiming_normal_(self.FC.weight, nonlinearity="sigmoid")
|
47 |
+
init.constant_(self.FC.bias, 0)
|
48 |
+
|
49 |
+
for m in (self.gru1, self.gru2):
|
50 |
+
stdv = math.sqrt(2 / (96 * 3 * 6 + 256))
|
51 |
+
for i in range(0, 256 * 3, 256):
|
52 |
+
init.uniform_(
|
53 |
+
m.weight_ih_l0[i : i + 256],
|
54 |
+
-math.sqrt(3) * stdv,
|
55 |
+
math.sqrt(3) * stdv,
|
56 |
+
)
|
57 |
+
init.orthogonal_(m.weight_hh_l0[i : i + 256])
|
58 |
+
init.constant_(m.bias_ih_l0[i : i + 256], 0)
|
59 |
+
init.uniform_(
|
60 |
+
m.weight_ih_l0_reverse[i : i + 256],
|
61 |
+
-math.sqrt(3) * stdv,
|
62 |
+
math.sqrt(3) * stdv,
|
63 |
+
)
|
64 |
+
init.orthogonal_(m.weight_hh_l0_reverse[i : i + 256])
|
65 |
+
init.constant_(m.bias_ih_l0_reverse[i : i + 256], 0)
|
66 |
+
|
67 |
+
def forward(self, x, coords):
|
68 |
+
# branch 1
|
69 |
+
x = self.conv1(x)
|
70 |
+
x = self.relu(x)
|
71 |
+
x = self.dropout3d(x)
|
72 |
+
x = self.pool1(x)
|
73 |
+
|
74 |
+
x = self.conv2(x)
|
75 |
+
x = self.relu(x)
|
76 |
+
x = self.dropout3d(x)
|
77 |
+
x = self.pool2(x)
|
78 |
+
|
79 |
+
x = self.conv3(x)
|
80 |
+
x = self.relu(x)
|
81 |
+
x = self.dropout3d(x)
|
82 |
+
x = self.pool3(x)
|
83 |
+
|
84 |
+
# (B, C, T, H, W)->(T, B, C, H, W)
|
85 |
+
x = x.permute(2, 0, 1, 3, 4).contiguous()
|
86 |
+
# (B, C, T, H, W)->(T, B, C*H*W)
|
87 |
+
x = x.view(x.size(0), x.size(1), -1)
|
88 |
+
|
89 |
+
self.gru1.flatten_parameters()
|
90 |
+
self.gru2.flatten_parameters()
|
91 |
+
|
92 |
+
x, h = self.gru1(x)
|
93 |
+
x = self.dropout(x)
|
94 |
+
x, h = self.gru2(x)
|
95 |
+
x = self.dropout(x)
|
96 |
+
|
97 |
+
# branch 2
|
98 |
+
# Process lip coordinates through GRU
|
99 |
+
self.coord_gru.flatten_parameters()
|
100 |
+
|
101 |
+
# (B, T, N, C)->(T, B, C, N, C)
|
102 |
+
coords = coords.permute(1, 0, 2, 3).contiguous()
|
103 |
+
# (T, B, C, N, C)->(T, B, C, N*C)
|
104 |
+
coords = coords.view(coords.size(0), coords.size(1), -1)
|
105 |
+
coords, _ = self.coord_gru(coords)
|
106 |
+
coords = self.dropout(coords)
|
107 |
+
|
108 |
+
# combine the two branches
|
109 |
+
combined = torch.cat((x, coords), dim=2)
|
110 |
+
|
111 |
+
x = self.FC(combined)
|
112 |
+
x = x.permute(1, 0, 2).contiguous()
|
113 |
+
return x
|
options.py
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gpu = "0"
|
2 |
+
random_seed = 0
|
3 |
+
data_type = "coords"
|
4 |
+
video_path = "lip_images/"
|
5 |
+
train_list = f"data/{data_type}_train.txt"
|
6 |
+
val_list = f"data/{data_type}_val.txt"
|
7 |
+
anno_path = "GRID_alignments"
|
8 |
+
coords_path = "lip_coordinates"
|
9 |
+
vid_padding = 75
|
10 |
+
txt_padding = 200
|
11 |
+
batch_size = 40
|
12 |
+
base_lr = 2e-5
|
13 |
+
num_workers = 16
|
14 |
+
max_epoch = 10000
|
15 |
+
display = 50
|
16 |
+
test_step = 1000
|
17 |
+
save_prefix = f"weights/LipCoordNet_{data_type}"
|
18 |
+
is_optimize = True
|
19 |
+
pin_memory = True
|
20 |
+
|
21 |
+
weights = "pretrain/LipCoordNet_coords_loss_0.025581153109669685_wer_0.01746208431890914_cer_0.006488426950253695.pt"
|
pretrain/LipCoordNet_coords_loss_0.025581153109669685_wer_0.01746208431890914_cer_0.006488426950253695.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:403ed48e2d410dcbd161f9572d0dc7778bf4fa5552daa5a903daf9bca43f56bc
|
3 |
+
size 27114996
|
requirements.txt
ADDED
Binary file (1.17 kB). View file
|
|
runs/Dec06_14-41-13_coords/events.out.tfevents.1701844873.coords
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1f08e38ed31ca7cb4c8017e67adb11ac19204924b5089660961d1906f3365d2c
|
3 |
+
size 51273
|
train.py
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
from torch.utils.data import DataLoader
|
4 |
+
import os
|
5 |
+
from dataset import MyDataset
|
6 |
+
import numpy as np
|
7 |
+
import time
|
8 |
+
from model import LipCoordNet
|
9 |
+
import torch.optim as optim
|
10 |
+
from tensorboardX import SummaryWriter
|
11 |
+
import options as opt
|
12 |
+
from tqdm import tqdm
|
13 |
+
|
14 |
+
|
15 |
+
def dataset2dataloader(dataset, num_workers=opt.num_workers, shuffle=True):
|
16 |
+
return DataLoader(
|
17 |
+
dataset,
|
18 |
+
batch_size=opt.batch_size,
|
19 |
+
shuffle=shuffle,
|
20 |
+
num_workers=num_workers,
|
21 |
+
drop_last=False,
|
22 |
+
pin_memory=opt.pin_memory,
|
23 |
+
)
|
24 |
+
|
25 |
+
|
26 |
+
def show_lr(optimizer):
|
27 |
+
lr = []
|
28 |
+
for param_group in optimizer.param_groups:
|
29 |
+
lr += [param_group["lr"]]
|
30 |
+
return np.array(lr).mean()
|
31 |
+
|
32 |
+
|
33 |
+
def ctc_decode(y):
|
34 |
+
y = y.argmax(-1)
|
35 |
+
return [MyDataset.ctc_arr2txt(y[_], start=1) for _ in range(y.size(0))]
|
36 |
+
|
37 |
+
|
38 |
+
def test(model, net):
|
39 |
+
with torch.no_grad():
|
40 |
+
dataset = MyDataset(
|
41 |
+
opt.video_path,
|
42 |
+
opt.anno_path,
|
43 |
+
opt.coords_path,
|
44 |
+
opt.val_list,
|
45 |
+
opt.vid_padding,
|
46 |
+
opt.txt_padding,
|
47 |
+
"test",
|
48 |
+
)
|
49 |
+
|
50 |
+
print("num_test_data:{}".format(len(dataset.data)))
|
51 |
+
model.eval()
|
52 |
+
loader = dataset2dataloader(dataset, shuffle=False)
|
53 |
+
loss_list = []
|
54 |
+
wer = []
|
55 |
+
cer = []
|
56 |
+
crit = nn.CTCLoss()
|
57 |
+
tic = time.time()
|
58 |
+
print("RUNNING VALIDATION")
|
59 |
+
pbar = tqdm(loader)
|
60 |
+
for i_iter, input in enumerate(pbar):
|
61 |
+
vid = input.get("vid").cuda(non_blocking=opt.pin_memory)
|
62 |
+
txt = input.get("txt").cuda(non_blocking=opt.pin_memory)
|
63 |
+
vid_len = input.get("vid_len").cuda(non_blocking=opt.pin_memory)
|
64 |
+
txt_len = input.get("txt_len").cuda(non_blocking=opt.pin_memory)
|
65 |
+
coord = input.get("coord").cuda(non_blocking=opt.pin_memory)
|
66 |
+
|
67 |
+
y = net(vid, coord)
|
68 |
+
|
69 |
+
loss = (
|
70 |
+
crit(
|
71 |
+
y.transpose(0, 1).log_softmax(-1),
|
72 |
+
txt,
|
73 |
+
vid_len.view(-1),
|
74 |
+
txt_len.view(-1),
|
75 |
+
)
|
76 |
+
.detach()
|
77 |
+
.cpu()
|
78 |
+
.numpy()
|
79 |
+
)
|
80 |
+
loss_list.append(loss)
|
81 |
+
pred_txt = ctc_decode(y)
|
82 |
+
|
83 |
+
truth_txt = [MyDataset.arr2txt(txt[_], start=1) for _ in range(txt.size(0))]
|
84 |
+
wer.extend(MyDataset.wer(pred_txt, truth_txt))
|
85 |
+
cer.extend(MyDataset.cer(pred_txt, truth_txt))
|
86 |
+
if i_iter % opt.display == 0:
|
87 |
+
v = 1.0 * (time.time() - tic) / (i_iter + 1)
|
88 |
+
eta = v * (len(loader) - i_iter) / 3600.0
|
89 |
+
|
90 |
+
print("".join(101 * "-"))
|
91 |
+
print("{:<50}|{:>50}".format("predict", "truth"))
|
92 |
+
print("".join(101 * "-"))
|
93 |
+
for predict, truth in list(zip(pred_txt, truth_txt))[:10]:
|
94 |
+
print("{:<50}|{:>50}".format(predict, truth))
|
95 |
+
print("".join(101 * "-"))
|
96 |
+
print(
|
97 |
+
"test_iter={},eta={},wer={},cer={}".format(
|
98 |
+
i_iter, eta, np.array(wer).mean(), np.array(cer).mean()
|
99 |
+
)
|
100 |
+
)
|
101 |
+
print("".join(101 * "-"))
|
102 |
+
|
103 |
+
return (np.array(loss_list).mean(), np.array(wer).mean(), np.array(cer).mean())
|
104 |
+
|
105 |
+
|
106 |
+
def train(model, net):
|
107 |
+
dataset = MyDataset(
|
108 |
+
opt.video_path,
|
109 |
+
opt.anno_path,
|
110 |
+
opt.coords_path,
|
111 |
+
opt.train_list,
|
112 |
+
opt.vid_padding,
|
113 |
+
opt.txt_padding,
|
114 |
+
"train",
|
115 |
+
)
|
116 |
+
|
117 |
+
loader = dataset2dataloader(dataset)
|
118 |
+
optimizer = optim.Adam(
|
119 |
+
model.parameters(), lr=opt.base_lr, weight_decay=0.0, amsgrad=True
|
120 |
+
)
|
121 |
+
|
122 |
+
print("num_train_data:{}".format(len(dataset.data)))
|
123 |
+
crit = nn.CTCLoss()
|
124 |
+
tic = time.time()
|
125 |
+
|
126 |
+
train_wer = []
|
127 |
+
for epoch in range(opt.max_epoch):
|
128 |
+
print(f"RUNNING EPOCH {epoch}")
|
129 |
+
pbar = tqdm(loader)
|
130 |
+
|
131 |
+
for i_iter, input in enumerate(pbar):
|
132 |
+
model.train()
|
133 |
+
vid = input.get("vid").cuda(non_blocking=opt.pin_memory)
|
134 |
+
txt = input.get("txt").cuda(non_blocking=opt.pin_memory)
|
135 |
+
vid_len = input.get("vid_len").cuda(non_blocking=opt.pin_memory)
|
136 |
+
txt_len = input.get("txt_len").cuda(non_blocking=opt.pin_memory)
|
137 |
+
coord = input.get("coord").cuda(non_blocking=opt.pin_memory)
|
138 |
+
|
139 |
+
optimizer.zero_grad()
|
140 |
+
y = net(vid, coord)
|
141 |
+
loss = crit(
|
142 |
+
y.transpose(0, 1).log_softmax(-1),
|
143 |
+
txt,
|
144 |
+
vid_len.view(-1),
|
145 |
+
txt_len.view(-1),
|
146 |
+
)
|
147 |
+
loss.backward()
|
148 |
+
|
149 |
+
if opt.is_optimize:
|
150 |
+
optimizer.step()
|
151 |
+
|
152 |
+
tot_iter = i_iter + epoch * len(loader)
|
153 |
+
|
154 |
+
pred_txt = ctc_decode(y)
|
155 |
+
|
156 |
+
truth_txt = [MyDataset.arr2txt(txt[_], start=1) for _ in range(txt.size(0))]
|
157 |
+
train_wer.extend(MyDataset.wer(pred_txt, truth_txt))
|
158 |
+
|
159 |
+
if tot_iter % opt.display == 0:
|
160 |
+
v = 1.0 * (time.time() - tic) / (tot_iter + 1)
|
161 |
+
eta = (len(loader) - i_iter) * v / 3600.0
|
162 |
+
|
163 |
+
writer.add_scalar("train loss", loss, tot_iter)
|
164 |
+
writer.add_scalar("train wer", np.array(train_wer).mean(), tot_iter)
|
165 |
+
print("".join(101 * "-"))
|
166 |
+
print("{:<50}|{:>50}".format("predict", "truth"))
|
167 |
+
print("".join(101 * "-"))
|
168 |
+
|
169 |
+
for predict, truth in list(zip(pred_txt, truth_txt))[:3]:
|
170 |
+
print("{:<50}|{:>50}".format(predict, truth))
|
171 |
+
print("".join(101 * "-"))
|
172 |
+
print(
|
173 |
+
"epoch={},tot_iter={},eta={},loss={},train_wer={}".format(
|
174 |
+
epoch, tot_iter, eta, loss, np.array(train_wer).mean()
|
175 |
+
)
|
176 |
+
)
|
177 |
+
print("".join(101 * "-"))
|
178 |
+
|
179 |
+
if tot_iter % opt.test_step == 0:
|
180 |
+
(loss, wer, cer) = test(model, net)
|
181 |
+
print(
|
182 |
+
"i_iter={},lr={},loss={},wer={},cer={}".format(
|
183 |
+
tot_iter, show_lr(optimizer), loss, wer, cer
|
184 |
+
)
|
185 |
+
)
|
186 |
+
writer.add_scalar("val loss", loss, tot_iter)
|
187 |
+
writer.add_scalar("wer", wer, tot_iter)
|
188 |
+
writer.add_scalar("cer", cer, tot_iter)
|
189 |
+
savename = "{}_loss_{}_wer_{}_cer_{}.pt".format(
|
190 |
+
opt.save_prefix, loss, wer, cer
|
191 |
+
)
|
192 |
+
(path, name) = os.path.split(savename)
|
193 |
+
if not os.path.exists(path):
|
194 |
+
os.makedirs(path)
|
195 |
+
torch.save(model.state_dict(), savename)
|
196 |
+
if not opt.is_optimize:
|
197 |
+
exit()
|
198 |
+
|
199 |
+
|
200 |
+
if __name__ == "__main__":
|
201 |
+
print("Loading options...")
|
202 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = opt.gpu
|
203 |
+
writer = SummaryWriter()
|
204 |
+
model = LipCoordNet()
|
205 |
+
model = model.cuda()
|
206 |
+
net = nn.DataParallel(model).cuda()
|
207 |
+
|
208 |
+
if hasattr(opt, "weights"):
|
209 |
+
pretrained_dict = torch.load(opt.weights)
|
210 |
+
model_dict = model.state_dict()
|
211 |
+
pretrained_dict = {
|
212 |
+
k: v
|
213 |
+
for k, v in pretrained_dict.items()
|
214 |
+
if k in model_dict.keys() and v.size() == model_dict[k].size()
|
215 |
+
}
|
216 |
+
|
217 |
+
# freeze the pretrained layers
|
218 |
+
for k, param in pretrained_dict.items():
|
219 |
+
param.requires_grad = False
|
220 |
+
|
221 |
+
missed_params = [
|
222 |
+
k for k, v in model_dict.items() if not k in pretrained_dict.keys()
|
223 |
+
]
|
224 |
+
print(
|
225 |
+
"loaded params/tot params:{}/{}".format(
|
226 |
+
len(pretrained_dict), len(model_dict)
|
227 |
+
)
|
228 |
+
)
|
229 |
+
print("miss matched params:{}".format(missed_params))
|
230 |
+
model_dict.update(pretrained_dict)
|
231 |
+
model.load_state_dict(model_dict)
|
232 |
+
|
233 |
+
torch.manual_seed(opt.random_seed)
|
234 |
+
torch.cuda.manual_seed_all(opt.random_seed)
|
235 |
+
torch.backends.cudnn.benchmark = True
|
236 |
+
|
237 |
+
train(model, net)
|