Update usage with infinity
#15
by
michaelfeil
- opened
Ready for review.
docker run --gpus all -p "7997":"7997" michaelf34/infinity:0.0.70 v2 --model-id TencentBAC/Conan-embedding-v1 --dtype float16 --batch-size 32 --engine to
rch --port 7997
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-26 17:54:25,299 infinity_emb INFO: infinity_server.py:92
Creating 1engines:
engines=['TencentBAC/Conan-embedding-v1']
INFO 2024-11-26 17:54:25,302 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2024-11-26 17:54:25,308 infinity_emb INFO: select_model.py:64
model=`TencentBAC/Conan-embedding-v1` selected,
using engine=`torch` and device=`None`
INFO 2024-11-26 17:54:25,644 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
TencentBAC/Conan-embedding-v1
INFO 2024-11-26 17:54:47,041 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO 2024-11-26 17:54:47,509 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=2
2.15 ms tokenization
17.79 ms inference
0.38 ms post-processing
20.32 ms total
embeddings/sec: 1575.01
INFO 2024-11-26 17:54:48,376 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=512
14.81 ms tokenization
398.44 ms inference
0.52 ms post-processing
413.77 ms total
embeddings/sec: 77.34
INFO 2024-11-26 17:54:48,381 infinity_emb INFO: model select_model.py:104
warmed up, between 77.34-1575.01 embeddings/sec at
batch_size=32
INFO 2024-11-26 17:54:48,383 infinity_emb INFO: batch_handler.py:443
creating batching engine
INFO 2024-11-26 17:54:48,385 infinity_emb INFO: ready batch_handler.py:512
to batch requests.
INFO 2024-11-26 17:54:48,388 infinity_emb INFO: infinity_server.py:106
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.70
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)