Spaces:
Runtime error
Runtime error
title: DockerTester | |
emoji: 📚 | |
colorFrom: red | |
colorTo: blue | |
sdk: docker | |
pinned: false | |
license: mit | |
RedPajama Dataset API | |
A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset | |
Overview | |
This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets. | |
Features | |
1. Retrieve Dataset Chunks | |
Fetch smaller, manageable subsets of the dataset to explore or preprocess. | |
2. Search Data | |
Search for specific keywords in the dataset and retrieve relevant results. | |
3. Dataset Summary | |
Get an overview of the dataset’s structure, including available splits. | |
Endpoints | |
Endpoint Method Parameters Description | |
/ GET None Displays a welcome message. | |
/get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset. | |
/search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword. | |
/data_summary/ GET None Displays a summary of the dataset. | |
Getting Started | |
Prerequisites | |
• Python 3.8+ | |
• Pip for dependency management | |
Setup | |
1. Clone the repository: | |
git clone https://huggingface.co/spaces/Canstralian/DockerTester | |
cd DockerTester | |
2. Install dependencies: | |
pip install -r requirements.txt | |
3. Run the application: | |
uvicorn app:app --host 0.0.0.0 --port 8000 | |
4. Access the API in your browser or using tools like Postman at: | |
http://127.0.0.1:8000 | |
Example Usage | |
1. Retrieve a Small Chunk of Data | |
Fetch 5 examples from the dataset: | |
curl "http://127.0.0.1:8000/get_data/?chunk_size=5" | |
2. Search the Dataset | |
Search for the keyword example and return up to 3 results: | |
curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3" | |
3. View Dataset Summary | |
Get an overview of available splits: | |
curl "http://127.0.0.1:8000/data_summary/" | |
Technologies Used | |
• FastAPI: For building the API. | |
• Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset. | |
• Uvicorn: For running the ASGI server. | |
• Python: Backend language. | |
Future Enhancements | |
• Add support for advanced filtering (e.g., by metadata or specific fields). | |
• Implement user authentication for restricted dataset access. | |
• Add visualization endpoints for dataset insights. | |
License | |
This project uses the Apache 2.0 License. Refer to the LICENSE file for more details. | |
Feel free to reach out for questions, feature requests, or contributions! | |