# README: Public Dataset Access ## Overview This README file provides details about accessing and utilizing the public dataset available at the following link: [Google Cloud Storage - Public Dataset](https://console.cloud.google.com/storage/browser/dx-scin-public-data/dataset?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))) This dataset is hosted on Google Cloud Storage and is publicly accessible for research, analysis, and educational purposes. The dataset contains various files structured to support users in their work with data science, analytics, and related fields. --- ## Accessing the Dataset ### Prerequisites - A Google account (optional for browsing but required for advanced actions like copying data to your Google Cloud Storage bucket). - Internet access and a compatible browser. ### Instructions 1. Open the provided [dataset link](https://console.cloud.google.com/storage/browser/dx-scin-public-data/dataset?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))). 2. Navigate through the folder structure to locate the desired files. 3. Use the download button for individual files or use the `gsutil` command-line tool for bulk downloads (details below). --- ## Downloading Data ### Using the Browser - Navigate to the desired file or folder. - Click the download icon next to the file name. ### Using `gsutil` Google's `gsutil` tool can be used for more robust and automated access. To install `gsutil`, refer to the [installation guide](https://cloud.google.com/storage/docs/gsutil_install). **Example commands:** ```bash # Copy a single file gsutil cp gs://dx-scin-public-data/dataset/filename.ext ./ # Copy an entire folder gsutil -m cp -r gs://dx-scin-public-data/dataset/ ./ ``` --- ## Dataset Structure The dataset is organized into folders and files. Below is an example structure: ``` /dataset ├── subfolder1 │ ├── file1.csv │ ├── file2.json ├── subfolder2 │ ├── file3.txt └── README.txt ``` ### File Formats - **CSV:** Structured tabular data, compatible with data analysis tools like Python (pandas), R, or Excel. - **JSON:** Lightweight data-interchange format, suitable for APIs or programmatic parsing. - **TXT:** Plain text files, often containing documentation or metadata. --- ## Usage Guidelines - This dataset is publicly available and may be used for non-commercial purposes. - Please acknowledge the source if used in publications or projects. - Always check individual file README or metadata files for specific details on data provenance and licensing. ### Citation Format If using this dataset, cite as: ``` Source: dx-scin-public-data Dataset, Google Cloud Storage URL: https://console.cloud.google.com/storage/browser/dx-scin-public-data/dataset ``` --- ## Support and Issues For issues accessing the dataset or technical questions, please contact the administrator or check the following resources: - [Google Cloud Storage Documentation](https://cloud.google.com/storage/docs/) - Support forum for the specific project hosting this dataset. --- ## Updates This dataset is periodically updated. Check the `CHANGELOG.md` (if available) or monitor the folder for new files or versions. --- ## Disclaimer The dataset is provided "as is" without warranty of any kind. The hosting provider or data contributors are not responsible for any inaccuracies or misuse. # SCIN Dataset Notebook This notebook demonstrates the usage of the SCIN dataset, a comprehensive collection of skin condition data provided in CSV format from Google Cloud Storage. The notebook covers the loading, exploration, and analysis of the dataset with Python tools. ## Features - **Data Loading**: Load metadata and labels from CSV files stored in Google Cloud Storage. - **Data Exploration**: Explore distributions of skin conditions, symptoms, and associated metadata. - **Visualization**: Display images associated with cases and plot distributions of categorical data. - **Customization**: Modify configurations like dataset path and columns of interest. ## Prerequisites Before running the notebook, ensure you have the following: 1. Python environment with required packages: - `matplotlib` - `google-cloud-storage` - `pandas` - `numpy` - `Pillow` 2. Access to Google Cloud Storage (GCS) for authenticating and downloading dataset files. ## Instructions ### Running in Google Colab This notebook is optimized for Google Colab. Follow these steps: 1. Open the notebook in Colab using the button below: [![Run in Colab](https://www.tensorflow.org/images/colab_logo_32px.png)](https://colab.research.google.com/github/google-research-datasets/scin/blob/main/scin_demo.ipynb) 2. Authenticate to access GCS by running the authentication cell. 3. Ensure the required libraries are installed by executing the setup cells. 4. Follow the provided code cells to load, explore, and analyze the data. ### Running Locally To run the notebook locally: 1. Clone the repository: ```bash git clone https://github.com/google-research-datasets/scin cd scin ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Authenticate with GCS using `gcloud` CLI: ```bash gcloud auth login ``` 4. Launch the notebook: ```bash jupyter notebook scin_demo.ipynb ``` ## Notebook Sections ### 1. Setup Install required libraries and configure GCS parameters. ### 2. Authenticate Authenticate with Google Cloud Storage for accessing dataset files. ### 3. Load Dataset Load metadata and labels into Pandas DataFrames for further processing. ### 4. Explore Data Analyze metadata and label distributions with descriptive statistics and visualizations. ### 5. Visualize Images Display associated images for selected cases along with their metadata and labels. ## Configuration Key configuration parameters are defined in the `Globals` class, allowing customization of: - GCP Project ID - GCS Bucket Name - Paths to dataset files and directories ## Example Outputs - Metadata statistics - Race/ethnicity distribution - Skin condition label distribution - Associated images and attributes ## Licensing This notebook and associated dataset are licensed under the Apache License, Version 2.0. For more details, see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file. ## Contact For questions or issues, reach out to the dataset maintainers or create an issue in the [GitHub repository](https://github.com/google-research-datasets/scin).