As the title says, I was wondering if this implements the same gradient checkpointing and flash memory system that vicuna uses for 4x the context ?
you mean flash attention?
· Sign up or log in to comment