# Mozilla Data Collective > All API requests should be made to the following base URL: --- # Source: https://datacollective.mozillafoundation.org/api-reference/docs #### Base URL All API requests should be made to the following base URL: ``` https://datacollective.mozillafoundation.org/api ``` #### Authentication All authenticated endpoints require an API key in the Authorization header: ``` Authorization: Bearer YOUR_API_KEY ``` You can create and manage your API keys in your [profile settings](/profile/credentials). ### API Endpoints [GET]`/datasets/:datasetId` ##### Get Dataset Details Retrieves the details of a specific dataset. ###### Authentication [Required] Bearer token in Authorization header ###### Path Parameters `datasetId` string [Required] The ID of the dataset ###### Success Response (200 OK) ``` , "license": "CC0-1.0", "task": "ASR", "format": "MP3", "datasetUrl": "https://datacollective.mozillafoundation.org/datasets/dataset-1" } ``` ###### Error Responses [404] Dataset not found [403] Access denied. Private dataset requires organization membership [POST]`/datasets/:datasetId/download` ##### Create Download Session Creates a download session and returns a download token. The user must have previously agreed to the dataset\'s terms of use through the web interface. ###### Authentication [Required] Bearer token in Authorization header ###### Path Parameters `datasetId` string [Required] The ID of the dataset ###### Success Response (200 OK) ``` ``` ###### Error Responses [403] You must agree to the terms of use before downloading this dataset [404] Dataset not found [401] Authentication required [429] Rate limit exceeded [GET]`/datasets/:datasetId/download/:downloadToken` ##### Download Dataset File Downloads the actual dataset file. ###### Authentication [Required] Bearer token in Authorization header ###### Request Headers `Range` string [Optional] Byte range for partial downloads e.g. \'Range: bytes=0-100\' ###### Path Parameters `datasetId` string [Required] The ID of the dataset `downloadToken` string [Required] The temporary download token ###### Success Response (200 OK) Response Headers: Content-Length: 268435456000 Content-Type: application/zip Content-Disposition: attachment; filename=\"common-voice-corpus-22.zip\" ``` Binary file data ``` ###### Success Response (206 Partial Content) Response Headers: Content-Length: 100 Content-Type: application/zip Content-Range: bytes 0-100/268435456000 Content-Disposition: attachment; filename=\"common-voice-corpus-22.zip\" ``` Partial binary file data ``` ###### Error Responses [401] Invalid or expired download token [404] Dataset or download session not found [416] Requested Range Not Satisfiable [429] Bandwidth limit exceeded ### Rate Limiting The API employs organization-level rate limiting to ensure fair usage and stability. Rate limits apply to both API requests and bandwidth consumption. ##### Request Rate Limiting When request limits are exceeded, the API responds with status code 429 and includes these headers: `X-RateLimit-Limit`[Total requests allowed in current window] `X-RateLimit-Remaining`[Requests remaining in current window] `Retry-After`[Seconds until next request allowed] ##### Bandwidth Rate Limiting Download endpoints enforce bandwidth limits at the organization level. When exceeded, connections are terminated with a 429 error. ``` } ``` ### Implementation Notes ###### Single Use Downloads Each download token can only be used for one complete download session. Once a file is fully downloaded, the token is invalidated. ###### Proxied Downloads All downloads are proxied through the API server for real-time rate limiting, access control, and analytics tracking. ###### Terms Agreement Required Users must agree to dataset terms through the web interface before downloading. API-only terms agreement is not supported. ### Error Handling ##### Common Error Responses [400] ###### Bad Request Malformed request or invalid parameters ``` } ``` [401] ###### Unauthorized Missing or invalid authentication ``` ``` [429] ###### Too Many Requests Rate limit exceeded ``` ``` --- # Source: https://datacollective.mozillafoundation.org/api-reference ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLW1lbnUgaC02IHctNiB0ZXh0LWJsYWNrIiBhcmlhLWhpZGRlbj0idHJ1ZSI+PHBhdGggZD0iTTQgMTJoMTYiIC8+PHBhdGggZD0iTTQgMThoMTYiIC8+PHBhdGggZD0iTTQgNmgxNiIgLz48L3N2Zz4=) [![Mozilla Data Collective Logo](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FMDC-Logo-Black-Alpha.9b2f961c.png&w=256&q=75)](/) [](/datasets) Datasets [](/api-reference) API [](https://community.mozilladatacollective.com/about) About ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWxhbmd1YWdlcyBoLTUgdy01IiBhcmlhLWhpZGRlbj0idHJ1ZSI+PHBhdGggZD0ibTUgOCA2IDYiIC8+PHBhdGggZD0ibTQgMTQgNi02IDItMyIgLz48cGF0aCBkPSJNMiA1aDEyIiAvPjxwYXRoIGQ9Ik03IDJoMSIgLz48cGF0aCBkPSJtMjIgMjItNS0xMC01IDEwIiAvPjxwYXRoIGQ9Ik0xNCAxOGg2IiAvPjwvc3ZnPg==)[\...]![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWNoZXZyb24tZG93biBoLTQgdy00IHRleHQtYmxhY2siIGFyaWEtaGlkZGVuPSJ0cnVlIj48cGF0aCBkPSJtNiA5IDYgNiA2LTYiIC8+PC9zdmc+) ## Harness community-driven datasets with our API ![](/_next/static/media/explosion-brown.af87765e.svg) [![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWdpdC1icmFuY2ggaC00IHctNCIgYXJpYS1oaWRkZW49InRydWUiPjxsaW5lIHgxPSI2IiB4Mj0iNiIgeTE9IjMiIHkyPSIxNSI+PC9saW5lPjxjaXJjbGUgY3g9IjE4IiBjeT0iNiIgcj0iMyI+PC9jaXJjbGU+PGNpcmNsZSBjeD0iNiIgY3k9IjE4IiByPSIzIj48L2NpcmNsZT48cGF0aCBkPSJNMTggOWE5IDkgMCAwIDEtOSA5IiAvPjwvc3ZnPg==)] **Version:** Beta The Mozilla Data Collective API gives developers access to community-created datasets while empowering contributors to maintain control over their data. ![](/_next/static/media/api-key.2fa33ae8.svg) Get API Access Browse API Docs ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC01IHctNSIgYXJpYS1oaWRkZW49InRydWUiPjxwYXRoIGQ9Ik0xNSAzaDZ2NiIgLz48cGF0aCBkPSJNMTAgMTQgMjEgMyIgLz48cGF0aCBkPSJNMTggMTN2NmEyIDIgMCAwIDEtMiAySDVhMiAyIDAgMCAxLTItMlY4YTIgMiAwIDAgMSAyLTJoNiIgLz48L3N2Zz4=) ### Mozilla Data Collective API at a glance ![](/_next/static/media/bolt.6cc5dd83.svg) ###### Create access credentials Manage your API credentials by going to Profile \> API ![](/_next/static/media/shield.adf7b3f8.svg) ###### Secure your key Store your access credentials in a secret key ![](/_next/static/media/union.8c13a218.svg) ###### Authentication Provide your API key in your request header to authenticate ![](/_next/static/media/globe.c5f0f2b4.svg) ###### Select your dataset Choose from over 300 global datasets to use ![](/_next/static/media/check-box.bde4576f.svg) ###### Agree to dataset terms You will only be able to download datasets after accepting terms ![](/_next/static/media/arrow.8945111a.svg) ###### Download Use our REST endpoint or the MDC python library to get started [](/profile/credentials) Create API credentials ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC00IHctNCIgYXJpYS1oaWRkZW49InRydWUiPjxwYXRoIGQ9Ik0xNSAzaDZ2NiIgLz48cGF0aCBkPSJNMTAgMTQgMjEgMyIgLz48cGF0aCBkPSJNMTggMTN2NmEyIDIgMCAwIDEtMiAySDVhMiAyIDAgMCAxLTItMlY4YTIgMiAwIDAgMSAyLTJoNiIgLz48L3N2Zz4=) ### API Overview Power your projects with diverse, ethically-created datasets that are just one REST call away. ![](/_next/static/media/api-key.2fa33ae8.svg) Get API Access Browse API Docs ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC01IHctNSIgYXJpYS1oaWRkZW49InRydWUiPjxwYXRoIGQ9Ik0xNSAzaDZ2NiIgLz48cGF0aCBkPSJNMTAgMTQgMjEgMyIgLz48cGF0aCBkPSJNMTggMTN2NmEyIDIgMCAwIDEtMiAySDVhMiAyIDAgMCAxLTItMlY4YTIgMiAwIDAgMSAyLTJoNiIgLz48L3N2Zz4=) ![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fagave-plant.4188388a.png&w=1080&q=75) copy![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWNvcHkgaC00IHctNCIgYXJpYS1oaWRkZW49InRydWUiPjxyZWN0IHdpZHRoPSIxNCIgaGVpZ2h0PSIxNCIgeD0iOCIgeT0iOCIgcng9IjIiIHJ5PSIyIiAvPjxwYXRoIGQ9Ik00IDE2Yy0xLjEgMC0yLS45LTItMlY0YzAtMS4xLjktMiAyLTJoMTBjMS4xIDAgMiAuOSAyIDIiIC8+PC9zdmc+) ``` // Test code for connecting to Mozilla Data API from datacollective import DataCollective client = DataCollective() client.get_dataset("mdc-dataset-id") ``` ### Give it a try Get up and running with datacollective-python, a Python library for authenticating and interacting with the MDC API. ![](/_next/static/media/api-key.2fa33ae8.svg)[](https://pypi.org/project/datacollective/) Python Library Browse Docs ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC01IHctNSIgYXJpYS1oaWRkZW49InRydWUiPjxwYXRoIGQ9Ik0xNSAzaDZ2NiIgLz48cGF0aCBkPSJNMTAgMTQgMjEgMyIgLz48cGF0aCBkPSJNMTggMTN2NmEyIDIgMCAwIDEtMiAySDVhMiAyIDAgMCAxLTItMlY4YTIgMiAwIDAgMSAyLTJoNiIgLz48L3N2Zz4=) ### Links & Docs [[]](/profile/credentials) Get API Access [[]](/api-reference/docs) Browse API Docs [[]](https://pypi.org/project/datacollective/) Python Library ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC01IHctNSB0cmFuc2l0aW9uLXRyYW5zZm9ybSBncm91cC1ob3Zlcjp0cmFuc2xhdGUteC0xIiBhcmlhLWhpZGRlbj0idHJ1ZSI+PHBhdGggZD0iTTE1IDNoNnY2IiAvPjxwYXRoIGQ9Ik0xMCAxNCAyMSAzIiAvPjxwYXRoIGQ9Ik0xOCAxM3Y2YTIgMiAwIDAgMS0yIDJINWEyIDIgMCAwIDEtMi0yVjhhMiAyIDAgMCAxIDItMmg2IiAvPjwvc3ZnPg==)[[]](https://github.com/Mozilla-Data-Collective/datacollective-python) Python Library Source ![](data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld2JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9ImN1cnJlbnRDb2xvciIgc3Ryb2tlLXdpZHRoPSIyIiBzdHJva2UtbGluZWNhcD0icm91bmQiIHN0cm9rZS1saW5lam9pbj0icm91bmQiIGNsYXNzPSJsdWNpZGUgbHVjaWRlLWV4dGVybmFsLWxpbmsgaC01IHctNSB0cmFuc2l0aW9uLXRyYW5zZm9ybSBncm91cC1ob3Zlcjp0cmFuc2xhdGUteC0xIiBhcmlhLWhpZGRlbj0idHJ1ZSI+PHBhdGggZD0iTTE1IDNoNnY2IiAvPjxwYXRoIGQ9Ik0xMCAxNCAyMSAzIiAvPjxwYXRoIGQ9Ik0xOCAxM3Y2YTIgMiAwIDAgMS0yIDJINWEyIDIgMCAwIDEtMi0yVjhhMiAyIDAgMCAxIDItMmg2IiAvPjwvc3ZnPg==) ![Mozilla Foundation Logo](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FMDC-Logo-White-Alpha.d9795f4c.png&w=640&q=75) [](/privacy) Privacy Policy![](/_next/static/media/external-link-white.9b62bd0f.svg) [](/terms) Terms![](/_next/static/media/external-link-white.9b62bd0f.svg) [](https://www.mozilla.org/en-US/privacy/websites/#cookies) Cookies![](/_next/static/media/external-link-white.9b62bd0f.svg) [](https://community.mozilladatacollective.com/tag/faq/) FAQs![](/_next/static/media/external-link-white.9b62bd0f.svg) [](https://www.mozilla.org/en-US/about/governance/policies/participation/) Participation Guidelines![](/_next/static/media/external-link-white.9b62bd0f.svg) [![Email Icon](/_next/static/media/email-white.0e7c561e.svg)](mailto:mozilladatacollective@mozillafoundation.org) mozilladatacollective@mozillafoundation.org ![](/_next/static/media/line-logo-white.249d380d.svg) [](/privacy) Privacy Policy![](/_next/static/media/external-link-white.9b62bd0f.svg) [](/terms/consumers) Terms![](/_next/static/media/external-link-white.9b62bd0f.svg) [](/cookies) Cookies![](/_next/static/media/external-link-white.9b62bd0f.svg) [](/faqs) FAQs![](/_next/static/media/external-link-white.9b62bd0f.svg) ![](/_next/static/media/line-logo-white.249d380d.svg) [![Email Icon](/_next/static/media/email-white.0e7c561e.svg)](mailto:mozilladatacollective@mozillafoundation.org) mozilladatacollective@mozillafoundation.org ![](/_next/static/media/line-logo-white.249d380d.svg) ![](/_next/static/media/mozilla-foundation-logo-white.49d832bb.svg) Brought to you by [Mozilla Foundation](https://www.mozillafoundation.org) --- # Source: https://datacollective.mozillafoundation.org/datasets # Datasets ## Explore Datasets ### Featured Datasets - **Common Voice Kinyarwanda** - High-quality speech data for machine learning applications - [Download](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7) - **Common Voice Chinese** - High-quality speech data for machine learning applications - [Download](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q) - **Common Voice Spanish** - High-quality speech data for machine learning applications - [Download](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v) - **Common Voice Catalan** - High-quality speech data for machine learning applications - [Download](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr) ## Datasets - **Bamun-French Parallel Corpus** - This dataset is a parallel corpus of Bamun (Shupamem) to French texts. Text were obtained by transcription of raw audio files. Translation were added to enrich the original corpus. Alignment of Bamun and French texts were made in the process of creating this dataset. - [Download](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7) - [License](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#size) - **Surmiran Newspaper Corpus** - 2.9 million tokens in the Surmiran variety of Romansh from the daily newspaper “La Quotidiana”. - [Download](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q) - [License](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#size) - **DhoNam: Dholuo Speech dataset** - DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. - This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read. - [Download](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v) - [License](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#size) - **Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)** - Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Transparencia y Acceso a la Información del Estado de Nuevo León (COTAI / INFONL) durante el periodo 2018-2025. - El dataset consolida el legado documental de una de las perfiles más técnicos y académicos del Sistema Nacional de Transparencia. Doctora en Derecho Constitucional por la Universidad Complutense de Madrid, la Comisionada Guzmán García - [Download](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr) - [License](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#size) - **Common Voice Spontaneous Speech 2.0 - Kenyah** - A collection of spontaneous spoken phrases in Kenyah. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#size) - **Common Voice Spontaneous Speech 2.0 - Ushojo** - A collection of spontaneous spoken phrases in Ushojo. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#size) - **Common Voice Spontaneous Speech 2.0 - Kuku** - A collection of spontaneous spoken phrases in Kuku. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#size) - **Common Voice Spontaneous Speech 2.0 - Rutoro** - A collection of spontaneous spoken phrases in Rutoro. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#size) - **Common Voice Spontaneous Speech 2.0 - Turkish** - A collection of spontaneous spoken phrases in Turkish. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#size) - **Common Voice Spontaneous Speech 2.0 - Amba** - A collection of spontaneous spoken phrases in Amba. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#size) - **Common Voice Spontaneous Speech 2.0 - Ruuli** - A collection of spontaneous spoken phrases in Ruuli. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#size) - **Common Voice Spontaneous Speech 2.0 - Russian** - A collection of spontaneous spoken phrases in Russian. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#size) - **Common Voice Spontaneous Speech 2.0 - Puno Quechua** - A collection of spontaneous spoken phrases in Puno Quechua. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#size) - **Common Voice Spontaneous Speech 2.0 - Western Penan** - A collection of spontaneous spoken phrases in Western Penan. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#size) - **Common Voice Spontaneous Speech 2.0 - Sabah Malay** - A collection of spontaneous spoken phrases in Sabah Malay. - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#size) ## Datasets - **Bamun-French Parallel Corpus** - [Download](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7) - [License](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjk758i00cfumk070r7nwve7#size) - **Surmiran Newspaper Corpus** - [Download](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q) - [License](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjhe0xap09gamb078g9loi3q#size) - **DhoNam: Dholuo Speech dataset** - [Download](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v) - [License](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjepxo6t08nmmk07iauvua6v#size) - **Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)** - [Download](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr) - [License](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmjcc6g9z06c7mk07yolcdyjr#size) - **Common Voice Spontaneous Speech 2.0 - Kenyah** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hr006pnxzp3s43beqr#size) - **Common Voice Spontaneous Speech 2.0 - Ushojo** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hj006lnxzpnj14uhpz#size) - **Common Voice Spontaneous Speech 2.0 - Kuku** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48hc006hnxzprn4k1cxx#size) - **Common Voice Spontaneous Speech 2.0 - Rutoro** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48h7006dnxzp3y4uqb69#size) - **Common Voice Spontaneous Speech 2.0 - Turkish** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48h10069nxzpo6tghopr#size) - **Common Voice Spontaneous Speech 2.0 - Amba** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48gq0061nxzpb3y7vi7v#size) - **Common Voice Spontaneous Speech 2.0 - Ruuli** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48fu005hnxzp78hiv9ll#size) - **Common Voice Spontaneous Speech 2.0 - Russian** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48ey004xnxzpphv4udzz#size) - **Common Voice Spontaneous Speech 2.0 - Puno Quechua** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48et004tnxzps28psruc#size) - **Common Voice Spontaneous Speech 2.0 - Western Penan** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48eo004pnxzp991piql1#size) - **Common Voice Spontaneous Speech 2.0 - Sabah Malay** - [Download](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c) - [License](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#license) - [Format](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#format) - [Size](https://datacollective.mozillafoundation.org/datasets/cmj8u48ej004lnxzp8sdt5z8c#size) --- # Source: https://raw.githubusercontent.com/Mozilla-Data-Collective/datacollective-python/main/README.md

Project logo

[![Published](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/publish.yml/badge.svg)](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/publish.yml/) [![Docs](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/docs.yml/badge.svg)](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/docs.yml/) [![Tests](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/tests.yml/badge.svg)](https://github.com/Mozilla-Data-Collective/datacollective-python/actions/workflows/tests.yml/)
# Mozilla Data Collective Python API Library Python library for interfacing with the [Mozilla Data Collective](https://datacollective.mozillafoundation.org/) REST API. ## Installation ```bash pip install datacollective ``` ## Quick Start 1. **Get your API key** from the Mozilla Data Collective [dashboard](https://datacollective.mozillafoundation.org/api-reference) 2. **Set the API key in your environment variable (or create `.env` file add it there)**: ``` export MDC_API_KEY=your-api-key-here ``` 3. **Get your dataset ID from the last section of the dataset URL at the MDC website**. For example, in the URL `https://datacollective.mozillafoundation.org/datasets/cmflnuzw43exbql8uukllvnqg`, the dataset ID is `cmflnuzw43exbql8uukllvnqg`. 4. **Save a dataset locally**: ``` from datacollective import save_dataset_to_disk dataset = save_dataset_to_disk("your-dataset-id") ``` 5. **Get information & metadata about a dataset**: ``` from datacollective import get_dataset_details details = get_dataset_details("your-dataset-id") ``` 6. **Load the dataset into a pandas DataFrame _(Only Common Voice datasets are supported right now)_**: ``` from datacollective import load_dataset dataset = load_dataset("your-dataset-id") ``` ## For more details, visit [our docs](https://Mozilla-Data-Collective.github.io/datacollective-python/) ## License This project is released under [MPL (Mozilla Public License) 2.0](./LICENSE). --- # Source: https://raw.githubusercontent.com/Mozilla-Data-Collective/datacollective-python/main/docs/api.md # API Reference ::: datacollective.datasets ::: datacollective.api_utils ::: datacollective.dataset_loading_scripts.registry ::: datacollective.dataset_loading_scripts.common_voice --- # Source: https://raw.githubusercontent.com/Mozilla-Data-Collective/datacollective-python/main/docs/index.md # Mozilla Data Collective Python SDK Library Welcome to the documentation for the `datacollective` Python client for the [Mozilla Data Collective](https://datacollective.mozillafoundation.org/) REST API. This library helps you: - Authenticate with the Mozilla Data Collective. - Download datasets to local storage. - Load supported datasets into AI-friendly formats, such as pandas DataFrames. ## Installation Install from PyPI: ```bash pip install datacollective ``` You can also use uv or other Python tooling as desired, as long as the package datacollective is installed in your environment. ## Getting an API Key To use the Mozilla Data Collective API, you need an API key: 1. Sign in to the Mozilla Data Collective dashboard. 2. Create or retrieve an API key from your account/settings page. 3. Keep your key secret and do not commit it to version control. ## Configuration The client reads configuration from environment variables and `.env` files. ### Environment variables Required: - `MDC_API_KEY` - Your Mozilla Data Collective API key. Optional: - `MDC_API_URL` - API endpoint (defaults to the production URL). - `MDC_DOWNLOAD_PATH` - Local directory where datasets will be downloaded (defaults to `~/.mozdata/datasets`). Example using environment variables directly: ```bash export MDC_API_KEY=your-api-key-here export MDC_API_URL=https://datacollective.mozillafoundation.org/api export MDC_DOWNLOAD_PATH=~/.mozdata/datasets ``` ### `.env` file The client will automatically load configuration from a `.env` file in your project root or present working directory. Create a file named `.env`: ```bash # MDC API Configuration MDC_API_KEY=your-api-key-here MDC_API_URL=https://datacollective.mozillafoundation.org/api MDC_DOWNLOAD_PATH=~/.mozdata/datasets ``` > **Security note:** do not commit `.env` files to version control, as they > contain secrets. ## Basic Usage ### Download a dataset Use `save_dataset_to_disk` to download a dataset to the configured download path: ```python from datacollective import save_dataset_to_disk dataset = save_dataset_to_disk("your-dataset-id") # Depending on the implementation, `dataset` may contain metadata # about the downloaded files or a higher-level dataset object. ``` The files will be stored under `MDC_DOWNLOAD_PATH` (default `~/.mozdata/datasets`). ## Loading and Querying Datasets > **Note:** in-memory dataset loading is currently supported only for certain datasets. You can load supported datasets into memory and convert them to a `pandas` `DataFrame` for analysis: ```python from datacollective import load_dataset dataset = load_dataset("your-dataset-id") # Convert to pandas df = dataset.to_pandas() # Inspect available splits (e.g., train, dev, test) print(dataset.splits) ``` Once loaded into a `DataFrame`, you can use standard `pandas` operations to filter, aggregate, and analyze the data. ## Get dataset details You can retrieve info from the datasheet of a dataset without downloading it: ```python from datacollective import get_dataset_info info = get_dataset_info("your-dataset-id") print(info) ``` ## API Reference For a detailed API reference, see the [API Reference](api.md) section of the documentation. ## Release Workflow > [!NOTE] > This section is intended for maintainers of the `datacollective` library. Check out the [Release Workflow](release.md) document for details on how to publish new versions of the library to PyPI using GitHub Actions. --- # Source: https://raw.githubusercontent.com/Mozilla-Data-Collective/datacollective-python/main/docs/release.md ## Release Workflow This repository uses GitHub Actions and branch-specific workflows for publishing releases. ### Branches - `main` - primary development branch. When a pull request is merged into `main` the repository workflow: - Runs the full check suite. - Bumps the version. - Opens a `release/vX.Y.Z` pull request back onto `main`. Auto-merge is enabled on that PR, so once required checks pass the version commit lands on `main` automatically. - `test-pypi` - receives releases from `main` to deploy to TestPyPI. - `pypi` - receives releases from `main` to deploy to the production PyPI index. ### Automated steps 1. **Prepare release on `main`** When a pull request is merged into `main`, the release workflow runs the full checks, performs the version bump, and opens the `release/vX.Y.Z` pull request onto `main`. That PR is configured to auto-merge once required checks complete, so the version commit is applied to `main` without manual intervention. 2. **Deploy to TestPyPI** Merge the updated `main` into `test-pypi` to deploy that version to TestPyPI. The following command runs automatically in the workflow: ```bash uv run python scripts/dev.py publish-test ``` 3. **Deploy to PyPI** After validating the package on TestPyPI, merge `main` into `pypi` to deploy to production. The following command runs automatically in the workflow: ```bash uv run python scripts/dev.py publish ``` ### Recommended local workflow Before opening release-related pull requests: 1. Run the full checks without modifying files: ```bash uv run python scripts/dev.py all ``` 2. Optionally rehearse the version bump locally: ```bash uv run python scripts/dev.py prepare-release ``` The repository workflow performs the same steps when `main` changes. 3. Follow the branch merge order so TestPyPI receives the version before production: - `main` -> `test-pypi` - `main` -> `pypi` ### Required GitHub Actions secrets - `TEST_PYPI_API_TOKEN` - token for publishing to TestPyPI (username `__token__`). - `PYPI_API_TOKEN` - token for publishing to PyPI (username `__token__`). --- # Source: https://raw.githubusercontent.com/Mozilla-Data-Collective/datacollective-python/main/src/datacollective/dataset_loading_scripts/README.md `load_dataset()` requires a certain dataset-specific logic in order to parse the data correctly from the downloaded files into a Pandas DataFrame. This directory contains dataset loading scripts for different datasets hosted on Mozilla Data Collective to enable the `load_dataset()` functionality.