Overview

Sinkove datasets contain synthetically generated medical images created using advanced diffusion models. Each dataset is customizable and designed for research, AI model training, and educational purposes.

Dataset Types

Medical Imaging Modalities

Sinkove supports generation of various medical imaging types:

  • Chest X-rays - Most common modality with multiple specialized models
  • Other Modalities - Additional imaging types available based on model selection

Dataset Formats

All datasets are delivered as ZIP archives containing:

  • Images - High-resolution synthetic medical images (typically 512x512 pixels)
  • Metadata - JSON files with image annotations and generation parameters
  • Model Cards - Documentation describing the AI models used for generation

Creating Datasets

Configuration Options

When creating a dataset, you can specify:

Model Selection

  • Choose from available generative models optimized for different scenarios
  • Each model may specialize in specific conditions or imaging characteristics

Prompts and Descriptions

  • Use clinical terminology for accurate generation
  • Examples: “Severe cardiomegaly”, “Bilateral pneumonia”, “Normal chest X-ray”
  • Multiple prompts can be used to create diverse datasets

Dataset Size

  • Currently limited to approximately 1000 images per dataset
  • Larger datasets require more generation time

Quality Settings

  • Configure generation parameters for your specific research needs
  • Balance between quality and generation speed

Start with smaller datasets (10-50 images) to test your configuration before generating larger collections.

Dataset Status Lifecycle

Status Phases

  1. Pending - Dataset creation request is queued
  2. Processing - AI models are actively generating images
  3. Ready - Dataset is complete and available for download
  4. Failed - Generation encountered an error
  5. Expired - Dataset is no longer available (after retention period)

Monitoring Progress

Track your dataset status through:

  • Web Dashboard - Real-time status updates in the platform
  • Python SDK - Programmatic status checking
  • API Endpoints - Direct REST API status queries
# Check dataset status
dataset = client.dataset("dataset_id")
print(f"Status: {dataset.status}")

# Wait for completion
while not dataset.ready():
    time.sleep(30)

Best Practices

Research Applications

Model Training

  • Use diverse prompts to create balanced training sets
  • Include both normal and pathological cases
  • Consider augmenting real datasets with synthetic data

Validation Studies

  • Generate datasets with known ground truth for algorithm testing
  • Create edge cases that are rare in real clinical data
  • Test model robustness with synthetic variations

Educational Use

  • Create teaching datasets with clear diagnostic features
  • Generate rare conditions for student training
  • Provide consistent, high-quality examples

Technical Considerations

Data Management

  • Download datasets promptly to avoid expiration
  • Organize datasets with clear naming conventions
  • Maintain metadata for reproducibility

Quality Assurance

  • Review sample images before generating large datasets
  • Validate synthetic images meet your research requirements
  • Consider expert review for clinical accuracy

Ethics and Compliance

  • Use synthetic data responsibly in research contexts
  • Clearly label synthetic data in publications
  • Follow institutional guidelines for AI-generated content

Storage and Retention

Download Requirements

  • Immediate Download - Download datasets upon completion
  • Retention Period - Datasets may be automatically deleted after a specified time
  • Storage Limits - Account storage quotas may apply

Local Management

# Organize downloaded datasets
import os
from datetime import datetime

def organize_dataset(dataset_id, download_path="./datasets"):
    """Download and organize dataset with timestamp"""

    # Create organized directory structure
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    dataset_dir = os.path.join(download_path, f"{dataset_id}_{timestamp}")
    os.makedirs(dataset_dir, exist_ok=True)

    # Download dataset
    dataset = client.dataset(dataset_id)
    if dataset.ready():
        filepath = os.path.join(dataset_dir, "dataset.zip")
        dataset.save(filepath)
        print(f"Dataset saved to: {filepath}")

    return dataset_dir

Troubleshooting

Common Issues

Generation Failures

  • Check prompt clarity and clinical accuracy
  • Verify model compatibility with requested content
  • Reduce dataset size if memory limitations occur

Download Problems

  • Ensure stable internet connection for large files
  • Verify sufficient local storage space
  • Check API key permissions

Quality Issues

  • Refine prompts with more specific medical terminology
  • Try different models for better results
  • Adjust generation parameters if available

Getting Support

For dataset-related issues:

  • Review generation logs in your dashboard
  • Contact support with specific dataset IDs
  • Provide detailed descriptions of quality concerns