Common Data Sources For Data Professionals
1. Public Datasets & Open Data Portals
- Kaggle – Machine learning & analytics competitions with free datasets.
- Google Dataset Search – Search engine for publicly available datasets.
- UCI Machine Learning Repository – Classic datasets for research & practice.
- Data.gov – U.S. government open data.
- EU Open Data Portal – European Union datasets.
- World Bank Data – Global economic, financial, and development statistics.
- UN Data – United Nations statistics and global indicators.
2. Specialized Data Sources
- ChEMBL / PubChem – Chemical compounds, bioactivity data, drug targets.
- IMDB Datasets – Movies, TV shows, ratings, reviews, crew, and cast info.
- GitHub Public Datasets – Code repositories, commit history, developer activity.
- FAOSTAT – Global agriculture, livestock, and food production statistics.
- ClinicalTrials.gov – Information on registered clinical trials worldwide.
- GISAID – Global virus genome data (e.g., COVID-19, influenza).
- Human Genome Project / Ensembl – Genomic sequences and annotations.
- OECD Data – Economic, social, and environmental indicators.
- NASA Earth Data – Satellite imagery, climate, and geospatial datasets.
- WHO Global Health Observatory (GHO) – Public health statistics and disease data.
- OpenStreetMap (OSM) – Free geographic and mapping data.
- WorldClim – Global climate data for modeling and GIS analysis.
- UNESCO Institute for Statistics (UIS) – Education, science, and cultural statistics.
- Global Biodiversity Information Facility (GBIF) – Biodiversity occurrence records.
- Trade Map (ITC) – International trade statistics and market access data.
3. Company & Internal Sources
- CRM Systems – Customer Relationship Management data (e.g., Salesforce, HubSpot).
- Transaction Databases – Sales, purchases, invoices, etc.
- Log Files – Website, server, or application logs.
- ERP Systems – Enterprise Resource Planning data (e.g., SAP, Oracle).
- Customer Feedback Forms – Surveys, reviews, NPS scores.
3. APIs & Web Services
- Twitter API / X API – Social media posts & engagement data.
- Google Analytics API – Website traffic & user behavior.
- OpenWeatherMap API – Weather data.
- Spotify API – Music streaming data.
- Financial APIs – Yahoo Finance, Alpha Vantage, Quandl for stock market data.
4. Web Scraping
- BeautifulSoup / Scrapy – Python tools to extract data from websites.
- Import.io / Octoparse – No-code scraping tools.
- LinkedIn Scraping (Ethically / API) – Company & career data (must follow terms).
5. Cloud Data Warehouses
- Google BigQuery – Cloud analytics datasets.
- Amazon S3 / AWS Data Exchange – Hosted datasets for analysis.
- Microsoft Azure Data Lake – Structured & unstructured cloud data.