Downloading files
huggingface_hub.hf_hub_download
< source >( repo_id: str filename: str subfolder: typing.Optional[str] = None repo_type: typing.Optional[str] = None revision: typing.Optional[str] = None library_name: typing.Optional[str] = None library_version: typing.Optional[str] = None cache_dir: typing.Union[str, pathlib.Path, NoneType] = None user_agent: typing.Union[typing.Dict, str, NoneType] = None force_download: typing.Optional[bool] = False force_filename: typing.Optional[str] = None proxies: typing.Optional[typing.Dict] = None etag_timeout: typing.Optional[float] = 10 resume_download: typing.Optional[bool] = False use_auth_token: typing.Union[bool, str, NoneType] = None local_files_only: typing.Optional[bool] = False legacy_cache_layout: typing.Optional[bool] = False )
Parameters
-
repo_id (
str
) — A user or an organization name and a repo name separated by a/
. -
filename (
str
) — The name of the file in the repo. -
subfolder (
str
, optional) — An optional value corresponding to a folder inside the model repo. -
repo_type (
str
, optional) — Set to"dataset"
or"space"
if uploading to a dataset or space,None
or"model"
if uploading to a model. Default isNone
. -
revision (
str
, optional) — An optional Git revision id which can be a branch name, a tag, or a commit hash. -
library_name (
str
, optional) — The name of the library to which the object corresponds. -
library_version (
str
, optional) — The version of the library. -
cache_dir (
str
,Path
, optional) — Path to the folder where cached files are stored. -
user_agent (
dict
,str
, optional) — The user-agent info in the form of a dictionary or a string. -
force_download (
bool
, optional, defaults toFalse
) — Whether the file should be downloaded even if it already exists in the local cache. -
proxies (
dict
, optional) — Dictionary mapping protocol to the URL of the proxy passed torequests.request
. -
etag_timeout (
float
, optional, defaults to10
) — When fetching ETag, how many seconds to wait for the server to send data before giving up which is passed torequests.request
. -
resume_download (
bool
, optional, defaults toFalse
) — IfTrue
, resume a previously interrupted download. -
use_auth_token (
str
,bool
, optional) — A token to be used for the download.- If
True
, the token is read from the HuggingFace config folder. - If a string, it’s used as the authentication token.
- If
-
local_files_only (
bool
, optional, defaults toFalse
) — IfTrue
, avoid downloading the file and return the path to the local cached file if it exists. -
legacy_cache_layout (
bool
, optional, defaults toFalse
) — IfTrue
, uses the legacy file cache layout i.e. just call hf_hub_url() thencached_download
. This is deprecated as the new cache layout is more powerful.
Download a given file if itβs not already present in the local cache.
The new cache file layout looks like this:
- The cache directory contains one subfolder per repo_id (namespaced by repo type)
- inside each repo folder:
- refs is a list of the latest known revision => commit_hash pairs
- blobs contains the actual file blobs (identified by their git-sha or sha256, depending on whether theyβre LFS files or not)
- snapshots contains one subfolder per commit, each βcommitβ contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.
[ 96] . βββ [ 160] modelsβjulien-cβEsperBERTo-small βββ [ 160] blobs β βββ [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd β βββ [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e β βββ [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812 βββ [ 96] refs β βββ [ 40] main βββ [ 128] snapshots βββ [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f β βββ [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812 β βββ [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd βββ [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48 βββ [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e βββ [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
Raises the following errors:
EnvironmentError
ifuse_auth_token=True
and the token cannot be found.OSError
if ETag cannot be determined.ValueError
if some parameter value is invalid- RepositoryNotFoundError
If the repository to download from cannot be found. This may be because it doesnβt exist,
or because it is set to
private
and you do not have access. - RevisionNotFoundError If the revision to download from cannot be found.
- EntryNotFoundError If the file to download cannot be found.
- LocalEntryNotFoundError If network is disabled or unavailable and file is not found in cache.
huggingface_hub.snapshot_download
< source >( repo_id: str revision: typing.Optional[str] = None repo_type: typing.Optional[str] = None cache_dir: typing.Union[str, pathlib.Path, NoneType] = None library_name: typing.Optional[str] = None library_version: typing.Optional[str] = None user_agent: typing.Union[typing.Dict, str, NoneType] = None proxies: typing.Optional[typing.Dict] = None etag_timeout: typing.Optional[float] = 10 resume_download: typing.Optional[bool] = False use_auth_token: typing.Union[bool, str, NoneType] = None local_files_only: typing.Optional[bool] = False allow_regex: typing.Union[typing.List[str], str, NoneType] = None ignore_regex: typing.Union[typing.List[str], str, NoneType] = None allow_patterns: typing.Union[typing.List[str], str, NoneType] = None ignore_patterns: typing.Union[typing.List[str], str, NoneType] = None )
Parameters
-
repo_id (
str
) — A user or an organization name and a repo name separated by a/
. -
revision (
str
, optional) — An optional Git revision id which can be a branch name, a tag, or a commit hash. -
repo_type (
str
, optional) — Set to"dataset"
or"space"
if uploading to a dataset or space,None
or"model"
if uploading to a model. Default isNone
. -
cache_dir (
str
,Path
, optional) — Path to the folder where cached files are stored. -
library_name (
str
, optional) — The name of the library to which the object corresponds. -
library_version (
str
, optional) — The version of the library. -
user_agent (
str
,dict
, optional) — The user-agent info in the form of a dictionary or a string. -
proxies (
dict
, optional) — Dictionary mapping protocol to the URL of the proxy passed torequests.request
. -
etag_timeout (
float
, optional, defaults to10
) — When fetching ETag, how many seconds to wait for the server to send data before giving up which is passed torequests.request
. -
resume_download (
bool
, optional, defaults toFalse) -- If
True`, resume a previously interrupted download. -
use_auth_token (
str
,bool
, optional) — A token to be used for the download.- If
True
, the token is read from the HuggingFace config folder. - If a string, it’s used as the authentication token.
- If
-
local_files_only (
bool
, optional, defaults toFalse
) — IfTrue
, avoid downloading the file and return the path to the local cached file if it exists. -
allow_patterns (
List[str]
orstr
, optional) — If provided, only files matching at least one pattern are downloaded. -
ignore_patterns (
List[str]
orstr
, optional) — If provided, files matching any of the patterns are not downloaded.
Download all files of a repo.
Downloads a whole snapshot of a repoβs files at the specified revision. This is useful when you want all files from a repo, because you donβt know which ones you will need a priori. All files are nested inside a folder in order to keep their actual filename relative to that folder.
An alternative would be to just clone a repo but this would require that the user always has git and git-lfs installed, and properly configured.
Raises the following errors:
EnvironmentError
ifuse_auth_token=True
and the token cannot be found.OSError
if ETag cannot be determined.ValueError
if some parameter value is invalid
huggingface_hub.hf_hub_url
< source >( repo_id: str filename: str subfolder: typing.Optional[str] = None repo_type: typing.Optional[str] = None revision: typing.Optional[str] = None )
Parameters
-
repo_id (
str
) — A namespace (user or an organization) name and a repo name separated by a/
. -
filename (
str
) — The name of the file in the repo. -
subfolder (
str
, optional) — An optional value corresponding to a folder inside the repo. -
repo_type (
str
, optional) — Set to"dataset"
or"space"
if uploading to a dataset or space,None
or"model"
if uploading to a model. Default isNone
. -
revision (
str
, optional) — An optional Git revision id which can be a branch name, a tag, or a commit hash.
Construct the URL of a file from the given information.
The resolved address can either be a huggingface.co-hosted url, or a link to Cloudfront (a Content Delivery Network, or CDN) for large files which are more than a few MBs.
Example:
>>> from huggingface_hub import hf_hub_url
>>> hf_hub_url(
... repo_id="julien-c/EsperBERTo-small", filename="pytorch_model.bin"
... )
'https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin'
Notes:
Cloudfront is replicated over the globe so downloads are way faster for the end user (and it also lowers our bandwidth costs).
Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache canβt ever be stale.
In terms of client-side caching from this library, we base our caching
on the objectsβ entity tag (ETag
), which is an identifier of a
specific version of a resource [1]_. An objectβs ETag is: its git-sha1
if stored in git, or its sha256 if stored in git-lfs.
References:
Caching
The methods displayed above are designed to work with a caching system that prevents re-downloading files. The caching system was updated in v0.8.0 to allow directory structure and file sharing across libraries that depend on the hub.
The caching system is designed as follows:
<CACHE_DIR>
ββ <MODELS>
ββ <DATASETS>
ββ <SPACES>
The <CACHE_DIR>
is usually your userβs home directory. However, it is customizable with the
cache_dir
argument on all methods, or by specifying the HF_HOME
environment variable.
Models, datasets and spaces share a common root. Each of these repositories contains the namespace (organization, username) if it exists, alongside the repository name:
<CACHE_DIR>
ββ models--julien-c--EsperBERTo-small
ββ models--lysandrejik--arxiv-nlp
ββ models--bert-base-cased
ββ datasets--glue
ββ datasets--huggingface--DataMeasurementsFiles
ββ spaces--dalle-mini--dalle-mini
It is within these folders that all files will now be downloaded from the hub. Caching ensures that a file isnβt downloaded twice if it already exists and wasnβt updated; but if it was updated, and youβre asking for the latest file, then it will download the latest file (while keeping the previous file intact in case you need it again).
In order to achieve this, all folders contain the same skeleton:
<CACHE_DIR>
ββ datasets--glue
β ββ refs
β ββ blobs
β ββ snapshots
...
Each folder is designed to contain the following:
Refs
The refs
folder contains files which indicates the latest revision of the given reference. For example,
if we have previously fetched a file from the main
branch of a repository, the refs
folder will contain a file named main
, which will itself contain the commit identifier of the current head.
If the latest commit of main
has aaaaaa
as identifier, then it will contain aaaaaa
.
If that same branch gets updated with a new commit, that has bbbbbb
as an identifier, then
redownloading a file from that reference will update the refs/main
file to contain bbbbbb
.
Blobs
The blobs
folder contains the actual files that we have downloaded. The name of each file is their hash.
Snapshots
The snapshots
folder contains symlinks to the blobs mentioned above. It is itself made up of several folders:
one per known revision!
In the explanation above, we had initially fetched a file from the aaaaaa
revision, before fetching a file from
the bbbbbb
revision. In this situation, we would now have two folders in the snapshots
folder: aaaaaa
and bbbbbb
.
In each of these folders, live symlinks that have the names of the files that we have downloaded. For example,
if we had downloaded the READMD.md
file at revision aaaaaa
, we would have the following path:
<CACHE_DIR>/<REPO_NAME>/snapshots/aaaaaa/README.md
That README.md
file is actually a symlink linking to the blob that has the hash of the file.
Creating the skeleton this way means opens up the mechanism to file sharing: if the same file was fetched in
revision bbbbbb
, it would have the same hash and the file would not need to be redownloaded.
In practice
In practice, it should look like the following tree in your cache:
[ 96] .
βββ [ 160] models--julien-c--EsperBERTo-small
βββ [ 160] blobs
β βββ [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
β βββ [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e
β βββ [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812
βββ [ 96] refs
β βββ [ 40] main
βββ [ 128] snapshots
βββ [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
β βββ [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
β βββ [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
βββ [ 128] bbc77c8132af1cc5cf678da3f1ddf2de43606d48
βββ [ 52] README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
βββ [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd