{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic curation of the Hugging Face Hub using Collections and the `huggingface_hub` library\n", "\n", "In this short tutorial, we will see how to create a Hugging Face Collection automatically using the `huggingface_hub` library. We'll focus on creating a collection that will curate the top 10% most used instruction tuning datasets on the Hub. \n", "\n", "If you are already familiar with Collections and the `huggingface_hub` library, you can skip to the next section.\n", "\n", "## What is a Hugging Face Collection?\n", "\n", "Collections are a recently added feature on the Hugging Face Hub which unlock some really powerful new ways of curating what is on the Hub. With the Hub becoming the defacto platform for open-source machine learning models, it is important to be able to curate the content on the Hub. Collections allow you to do just that.\n", "\n", "Collections can be used to organize models, datasets, Spaces, and papers on the Hub in various different ways. You could for example create collections around a particular use case, or a particular topic, or a particular model architecture. You could also create collections that are a combination of these things. In this tutorial, we will create a collection that curates the top 10% most used instruction tuning datasets on the Hub. We will do this using the `huggingface_hub` library.\n", "\n", "## So what is the `huggingface_hub` library?\n", "\n", "The `hub` library is a Python library that allows you to interact with the Hugging Face Hub. It allows you to do things like upload and download models, datasets, and Spaces. Recently the library added support for creating and managing collection. This ability to programmatically create and manage collections unlocks a bunch of exciting new use cases. In this tutorial we'll show a few possibilities of what you can do with the `huggingface_hub` library and Collections but we're excited to see what you will do with it! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install packages\n", "\n", "For this tutorial, the only package we'll need outside of the Python standard library is the `huggingface_hub` library." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HKyybBVZ1hBh", "outputId": "ceaa1d1a-85e6-4015-e4e1-04bbe88d05cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting git+https://github.com/huggingface/huggingface_hub\n", " Cloning https://github.com/huggingface/huggingface_hub to /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-hs4ssvjo\n", " Running command git clone --filter=blob:none --quiet https://github.com/huggingface/huggingface_hub /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-hs4ssvjo\n", " Resolved https://github.com/huggingface/huggingface_hub to commit c32d4b31b679c9e91b906709631901f6aa85324d\n", " Installing build dependencies ... \u001b[?25ldone\n", "\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n", "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n", "\u001b[?25hRequirement already satisfied: filelock in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (3.12.4)\n", "Requirement already satisfied: fsspec>=2023.5.0 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (2023.9.2)\n", "Requirement already satisfied: requests in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (2.31.0)\n", "Requirement already satisfied: tqdm>=4.42.1 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (4.66.1)\n", "Requirement already satisfied: pyyaml>=5.1 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (6.0.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (4.8.0)\n", "Requirement already satisfied: packaging>=20.9 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (23.1)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub==0.18.0.dev0) (3.2.0)\n", "Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub==0.18.0.dev0) (3.4)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub==0.18.0.dev0) (2.0.5)\n", "Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub==0.18.0.dev0) (2023.7.22)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install git+https://github.com/huggingface/huggingface_hub --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Authenticate\n", "\n", "In order to create and manage collections, you need to be authenticated. You can do this via the `huggingface_hub` library using the `login` function. This function will detect where you are running your code and suggest the best way to authenticate." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "Qn9p5Bsz2NN5" }, "outputs": [], "source": [ "from huggingface_hub import login" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 145, "referenced_widgets": [ "428d3687eb4342e59d23318099afe34f", "18f533e671114b6385428a534364f10a", "e4c0e23001254742a94898203a222c6c", "9d970a88c8c04bc586473251393aaec7", "9f8288bb8cae4796a067580ff7afce69", "077012b6f63e4148848c9b9e8726fb18", "14f46443f97c4b4fb46c5967aec1178f", "5aaff54fecb84936a8dc9fee4393494d", "aee4b5a2e361451dae879f37222245f3", "9dfa5a7ee7794a5d8396674db2c0b683", "3634abd523b7477082a0a8135f1fa770", "023691d310634e6e83da20b9575759a2", "dded08e463404a53abb86ac605968626", "9d99d6e39a424145b017abff9021d9a0", "38e81f2aed79485498035e9c418165b4", "37807c4db5834365b86bc92c36835220", "847b9e4085814f958a147286be4f56eb", "6efac326ad7946e3a9ecc22b50568633", "9d043d68e16440899a6fc9b740f5970d", "88242ee08b884098ad743c1738b7dc97", "862cc50e401845fa98054e6bd015a074", "4fe39f9b54474fb4966f71c7df0cf93e", "d97e9b98cade45eaa2b9b526b1a4bb98", "b7f608ef35d84fd7a736260236025429", "09ca4ed6420340e0b76d64949301bed4", "b2261f5044db4af0bed02f76115d08f9", "055c0da20e264ec896963f9edf372a7d", "d0f55244ec614704a571f920ffa27bfd", "3b77fe48c5e44c879998de497be7a381", "15032b9578124624bcc42771cb5d5ad8", "eca43f65d6c1407bbb16cd26f60d5b7f", "934d899f6c604fa1bf4a8108aa09b190" ] }, "id": "Sv15J3mW2Ous", "outputId": "e537a566-8cdd-4316-bb69-72f5353345da" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b3c0b966dfeb400c953e3e13689d5a0d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='